Skip to content

Commit

Permalink
Deploying to gh-pages from @ a30c248 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
camilavargasp committed Oct 8, 2024
1 parent 4267b48 commit 724edd2
Show file tree
Hide file tree
Showing 6 changed files with 30 additions and 20 deletions.
2 changes: 1 addition & 1 deletion 2024-10-coreR/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -744,7 +744,7 @@
"href": "session_09.html#introduction",
"title": "9  Cleaning and Wrangling Data",
"section": "9.1 Introduction",
"text": "9.1 Introduction\nThe data we get to work with are rarely, if ever, in the format we need to do our analyses. It’s often the case that one package requires data in one format, while another package requires the data to be in another format. To be efficient analysts, we should have good tools for reformatting data for our needs so we can do further work like making plots and fitting models. The dplyr and tidyr R packages provide a fairly complete and extremely powerful set of functions for us to do this reformatting quickly. Learning these tools well will greatly increase your efficiency as an analyst.\nLet’s look at two motivating examples.\n\n\n\n\n\n\nExample 1\n\n\n\nSuppose you have the following data.frame called length_data with data about salmon length and want to calculate the average length per year.\n\n\n\nyear\nlength_cm\n\n\n\n\n1990\n5.673318\n\n\n1991\n3.081224\n\n\n1991\n4.592696\n\n\n1992\n4.381523\n\n\n1992\n5.597777\n\n\n1992\n4.900052\n\n\n\nThe dplyr R library provides a fast and powerful way to do this calculation in a few lines of code:\n\nlength_data %>% \n group_by(year) %>% \n summarize(mean_length_cm = mean(length_cm))\n\n\n\n\n\n\n\n\n\nExample 2\n\n\n\nAnother process we often need to do is to “reshape” our data. Consider the following table that is in what we call “wide” format:\n\n\n\nsite\n1990\n1991\n…\n1993\n\n\n\n\ngold\n100\n118\n…\n112\n\n\nlake\n100\n118\n…\n112\n\n\n…\n…\n…\n…\n…\n\n\ndredge\n100\n118\n…\n112\n\n\n\nYou are probably familiar with data in the above format, where values of the variable being observed are spread out across columns. In this example we have a different column per year. This wide format works well for data entry and sometimes works well for analysis but we quickly outgrow it when using R (and know it is not tidy data!). For example, how would you fit a model with year as a predictor variable? In an ideal world, we’d be able to just run lm(length ~ year). But this won’t work on our wide data because lm() needs length and year to be columns in our table.\nThe tidyr package allows us to quickly switch between wide format and long format using the pivot_longer() function:\n\nsite_data %>% \n pivot_longer(-site, names_to = \"year\", values_to = \"length\")\n\n\n\n\nsite\nyear\nlength\n\n\n\n\ngold\n1990\n101\n\n\nlake\n1990\n104\n\n\ndredge\n1990\n144\n\n\n…\n…\n…\n\n\ndredge\n1993\n145\n\n\n\n\n\nThis lesson will cover examples to learn about the functions you’ll most commonly use from the dplyr and tidyr packages:\n\nCommon dplyr functions\n\n\n\n\n\n\nFunction name\nDescription\n\n\n\n\nmutate()\nCreates modify and deletes columns\n\n\ngroup_by()\nGroups data by one or more variables\n\n\nsummarise()\nSummaries each group down to one row\n\n\nselect()\nKeep or drop columns using their names\n\n\nfilter()\nKeeps rows that matches conditions\n\n\narrange()\norder rows using columns variable\n\n\nrename()\nRename a column\n\n\n\n\nCommon tidyr functions\n\n\n\n\n\n\nFunction name\nDescription\n\n\n\n\npivot_longer()\ntransforms data from a wide to a long format\n\n\npivot_wider()\ntransforms data from a long to a wide format\n\n\nunite()\nUnite multiple columns into one by pasting strings together\n\n\nseparate()\nSeparate a character column into multiple columns with a regular expression or numeric locations",
"text": "9.1 Introduction\nThe data we get to work with are rarely, if ever, in the format we need to do our analyses. It’s often the case that one package requires data in one format, while another package requires the data to be in another format. To be efficient analysts, we should have good tools for reformatting data for our needs so we can do further work like making plots and fitting models. The dplyr and tidyr R packages provide a fairly complete and extremely powerful set of functions for us to do this reformatting quickly. Learning these tools well will greatly increase your efficiency as an analyst.\nLet’s look at two motivating examples.\n\n\n\n\n\n\nExample 1\n\n\n\nSuppose you have the following data.frame called length_data with data about salmon length and want to calculate the average length per year.\n\n\n\nyear\nlength_cm\n\n\n\n\n1990\n5.673318\n\n\n1991\n3.081224\n\n\n1991\n4.592696\n\n\n1992\n4.381523\n\n\n1992\n5.597777\n\n\n1992\n4.900052\n\n\n\nBefore thinking about the code, let’s think about the steps we need to take to get to the answer (aka pseudocode).\nNow, how would we code this? The dplyr R library provides a fast and powerful way to do this calculation in a few lines of code:\n\n\nAnswer\nlength_data %>% \n group_by(year) %>% \n summarize(mean_length_cm = mean(length_cm))\n\n\n\n\n\n\n\n\n\n\nExample 2\n\n\n\nAnother process we often need to do is to “reshape” our data. Consider the following table that is in what we call “wide” format:\n\n\n\nsite\n1990\n1991\n…\n1993\n\n\n\n\ngold\n100\n118\n…\n112\n\n\nlake\n100\n118\n…\n112\n\n\n…\n…\n…\n…\n…\n\n\ndredge\n100\n118\n…\n112\n\n\n\nYou are probably familiar with data in the above format, where values of the variable being observed are spread out across columns. In this example we have a different column per year. This wide format works well for data entry and sometimes works well for analysis but we quickly outgrow it when using R (and know it is not tidy data!). For example, how would you fit a model with year as a predictor variable? In an ideal world, we’d be able to just run lm(length ~ year). But this won’t work on our wide data because lm() needs length and year to be columns in our table.\nWhat steps would you take to get this data frame in a long format?\nThe tidyr package allows us to quickly switch between wide format and long format using the pivot_longer() function:\n\n\nAnswer\nsite_data %>% \n pivot_longer(-site, \n names_to = \"year\", \n values_to = \"length\")\n\n\n\n\n\nsite\nyear\nlength\n\n\n\n\ngold\n1990\n101\n\n\nlake\n1990\n104\n\n\ndredge\n1990\n144\n\n\n…\n…\n…\n\n\ndredge\n1993\n145\n\n\n\n\n\nThis lesson will cover examples to learn about the functions you’ll most commonly use from the dplyr and tidyr packages:\n\nCommon dplyr functions\n\n\n\n\n\n\nFunction name\nDescription\n\n\n\n\nmutate()\nCreates modify and deletes columns\n\n\ngroup_by()\nGroups data by one or more variables\n\n\nsummarise()\nSummaries each group down to one row\n\n\nselect()\nKeep or drop columns using their names\n\n\nfilter()\nKeeps rows that matches conditions\n\n\narrange()\norder rows using columns variable\n\n\nrename()\nRename a column\n\n\n\n\nCommon tidyr functions\n\n\n\n\n\n\nFunction name\nDescription\n\n\n\n\npivot_longer()\ntransforms data from a wide to a long format\n\n\npivot_wider()\ntransforms data from a long to a wide format\n\n\nunite()\nUnite multiple columns into one by pasting strings together\n\n\nseparate()\nSeparate a character column into multiple columns with a regular expression or numeric locations",
"crumbs": [
"<span class='chapter-number'>9</span>  <span class='chapter-title'>Cleaning and Wrangling Data</span>"
]
Expand Down
16 changes: 13 additions & 3 deletions 2024-10-coreR/session_09.html
Original file line number Diff line number Diff line change
Expand Up @@ -394,11 +394,15 @@ <h2 data-number="9.1" class="anchored" data-anchor-id="introduction"><span class
</tr>
</tbody>
</table>
<p>The <code>dplyr</code> R library provides a fast and powerful way to do this calculation in a few lines of code:</p>
<p>Before thinking about the code, let’s think about the steps we need to take to get to the answer (aka pseudocode).</p>
<p>Now, how would we code this? The <code>dplyr</code> R library provides a fast and powerful way to do this calculation in a few lines of code:</p>
<div class="cell">
<details class="code-fold">
<summary>Answer</summary>
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>length_data <span class="sc">%&gt;%</span> </span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(year) <span class="sc">%&gt;%</span> </span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarize</span>(<span class="at">mean_length_cm =</span> <span class="fu">mean</span>(length_cm))</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</details>
</div>
</div>
</div>
Expand Down Expand Up @@ -455,10 +459,16 @@ <h2 data-number="9.1" class="anchored" data-anchor-id="introduction"><span class
</tbody>
</table>
<p>You are probably familiar with data in the above format, where values of the variable being observed are spread out across columns. In this example we have a different column per year. This wide format works well for data entry and sometimes works well for analysis but we quickly outgrow it when using R (and know it is not tidy data!). For example, how would you fit a model with year as a predictor variable? In an ideal world, we’d be able to just run <code>lm(length ~ year)</code>. But this won’t work on our wide data because <code>lm()</code> needs <code>length</code> and <code>year</code> to be columns in our table.</p>
<p>What steps would you take to get this data frame in a long format?</p>
<p>The <code>tidyr</code> package allows us to quickly switch between wide format and long format using the <code>pivot_longer()</code> function:</p>
<div class="cell">
<details class="code-fold">
<summary>Answer</summary>
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>site_data <span class="sc">%&gt;%</span> </span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">pivot_longer</span>(<span class="sc">-</span>site, <span class="at">names_to =</span> <span class="st">"year"</span>, <span class="at">values_to =</span> <span class="st">"length"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">pivot_longer</span>(<span class="sc">-</span>site, </span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="at">names_to =</span> <span class="st">"year"</span>, </span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a> <span class="at">values_to =</span> <span class="st">"length"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</details>
</div>
<table class="caption-top table">
<thead>
Expand Down Expand Up @@ -696,7 +706,7 @@ <h2 data-number="9.2" class="anchored" data-anchor-id="data-cleaning-basics"><sp
</section>
<section id="data-exploration" class="level2" data-number="9.3">
<h2 data-number="9.3" class="anchored" data-anchor-id="data-exploration"><span class="header-section-number">9.3</span> Data exploration</h2>
<p>Similar to what we did in our <a href="https://learning.nceas.ucsb.edu/2024-06-delta/session_04.html">Literate Analysis</a> lesson, it is good practice to skim through the data you just read in.</p>
<p>Similar to what we did in our <a href="https://learning.nceas.ucsb.edu/2024-10-coreR/session_05.html">Literate Analysis</a> lesson, it is good practice to skim through the data you just read in.</p>
<p>Doing so is important to make sure the data is read as you were expecting and to familiarize yourself with the data.</p>
<p>Some of the basic ways to explore your data are:</p>
<div class="cell">
Expand Down
Loading

0 comments on commit 724edd2

Please sign in to comment.