forked from poole/lanyon
-
Notifications
You must be signed in to change notification settings - Fork 5
/
03-data-frames-2.Rmd
345 lines (258 loc) · 12.3 KB
/
03-data-frames-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---
layout: page
title: 03 -- `data.frame` continued
---
# Getting back where we started
Let's start with a clean working directory. Create the folder `data/` within
your working directory, and download the 2 datasets:
```{r, eval=FALSE}
download.file("http://r-bio.github.io/data/surveys.csv",
"data/surveys.csv")
download.file("http://r-bio.github.io/data/species.csv",
"data/species.csv")
```
and then check that you can load the surveys dataset (for now) into R:
```{r, echo=FALSE}
surveys <- read.csv(file="data/surveys.csv", stringsAsFactors=FALSE)
head(surveys)
str(surveys)
```
# About the `data.frame` class
```{r, echo=FALSE, purl=TRUE}
## The data.frame class
```
`data.frame` is the _de facto_ data structure for most tabular data and what we
use for statistics and plotting.
A `data.frame` is a collection of vectors of identical lengths. Each vector
represents a column, and each vector can be of a different class (e.g.,
characters, integers, factors). The `str()` function is useful to inspect the
data types of the columns.
The most common way you are going to create `data.frame` objects is when you
will use the functions `read.csv()` or `read.table()`, in other words, when
importing spreadsheets from your hard drive (or the web).
You can also create `data.frame` manually with the function `data.frame()`. This
function can also take the argument `stringsAsFactors`. Compare the output of
these examples:
```{r, results='show', purl=TRUE}
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8))
str(example_data)
```
Here you can observe the default behavior of the `data.frame` function. The
columns `animal` and `feel` are of class `factor`. By default, `data.frame`
converts (= coerces) columns that contain characters (i.e., text) into a vector
of class `factor`. Depending on what you want to do with the data, you may want
to keep these columns as `character`. To do so, `read.csv()` and `read.table()`
have an argument called `stringsAsFactors` which can be set to `FALSE`:
```{r, results='show', purl=TRUE}
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
feel=c("furry", "furry", "squishy", "spiny"),
weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
```
If you want to manually change the class of one of the column, you can use the
function `as.factor()` (below we'll cover in more detail how to work with
columns):
```{r, results='show', purl=TRUE}
example_data$feel <- as.factor(example_data$feel)
str(example_data)
```
### Challenge
1. There are a few mistakes in this hand crafted `data.frame`, can you spot and
fix them? Don't hesitate to experiment!
```{r, eval=FALSE, purl=TRUE}
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
author_last=c(Darwin, Mayr, Dobzhansky),
year=c(1942, 1970))
```
2. Can you predict the class for each of the columns in the following example?
Check your guesses using `str(country_climate)`. Are they what you expected?
Why? why not?
```{r, purl=TRUE}
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
climate=c("cold", "hot", "temperate", "hot/temperate"),
temperature=c(10, 30, 18, "15"),
north_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo=c(FALSE, FALSE, FALSE, 1))
```
Check your guesses using `str(country_climate)`. Are they what you expected?
Why? why not?
R coerces (when possible) to the data type that is the least common
denominator and the easiest to coerce to. You can review the notes from the
second lecture to review the coercion rules R uses.
# Inspecting `data.frame` objects
We already saw how the functions `head()` and `str()` can be useful to check the
content and the structure of a `data.frame`. Here is a non-exhaustive list of
functions to get a sense of the content/structure of the data.
* Size:
* `dim()` - returns a vector with the number of rows in the first element, and
the number of columns as the second element (the __dim__ensions of the object)
* `nrow()` - returns the number of rows
* `ncol()` - returns the number of columns
* Content:
* `head()` - shows the first 6 rows
* `tail()` - shows the last 6 rows
* Names:
* `names()` - returns the column names (synonym of `colnames()` for `data.frame`
objects)
* `rownames()` - returns the row names
* Summary:
* `str()` - structure of the object and information about the class, length and
content of each column
* `summary()` - summary statistics for each column
Note: most of these functions are "generic", they can be used on other types of
objects besides `data.frame`.
### Challenge
Use these functions on the `surveys` data set loaded in R.
# Slicing data
Our survey data frame has rows and columns (it's a 2-dimensional object), if we
want to extract some specific data from it (a slice of it), we need to specify
the "coordinates" we want the data to come from. To do this, we use the square
bracket notation (just like with vectors), except that we need to add a comma to
indicate the rows and columns we want. Row numbers come first, followed by
column numbers. Here are some examples:
```{r, results='hide', purl=FALSE}
surveys[1, 1] # first element in the first column of the data frame
surveys[1, 6] # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[3, ] # the 3rd element for all columns
surveys[, 8] # the entire 8th column
head_surveys <- surveys[1:6, ] # surveys[1:6, ] is equivalent to head(surveys)
```
### Challenge
1. The function `nrow()` on a `data.frame` returns the number of rows. Use it,
in conjuction with `seq()` to create a new `data.frame` called
`surveys_by_10` that includes every 10th row of the survey data frame
starting at row 10 (10, 20, 30, ...)
# Subsetting data
```{r, echo=FALSE, purl=TRUE}
## subsetting data
```
In particular for larger datasets, it can be tricky to remember the column
number that corresponds to a particular variable (Are species names in column 6
or 8? oh, right... they are in column 7), and using the column number to extract
the data (i.e., `surveys[, 7]`) may not be practical. In some cases, in which
column the variable will be can change if the script you are using adds or
removes columns. It's therefore often better to use column names to refer to a
particular variable, and it makes your code easier to read and your intentions
clearer.
You can do operations on a particular column, by selecting it using the `$`
sign. In this case, the entire column is a vector. For instance, to extract all
the weights from our datasets, we can use: `surveys$wgt`. You can use
`names(surveys)` or `colnames(surveys)` to remind yourself of the column names.
In some cases, you may way to select more than one column. You can do this using
the square brackets: `surveys[, c("wgt", "sex")]`.
When analyzing data, though, we often want to look at partial statistics, such
as the maximum value of a variable per species or the average value per plot.
One way to do this is to select the data we want, and create a new temporary
array, using the `subset()` function. For instance, if we just want to look at
the animals of the species "DO":
```{r, purl=FALSE}
surveys_DO <- subset(surveys, species == "DO")
```
### Challenge
1. What does the following do (Try to guess without executing it)?
`surveys_DO$month[2] <- 8`
1. Use the function `subset` to create a `data.frame` that contains all
individuals of the species "DM" that were collected in 2002. How many
individuals of the species "DM" were collected in 2002?
# Adding a column to our dataset
```{r, echo=FALSE, purl=TRUE}
## Adding columns
```
Sometimes, you may have to add a new column to your dataset that represents a
new variable. You can add columns to a `data.frame` using the function `cbind()`
(__c__olumn __bind__). Beware, the additional column must have the same number
of elements as there are rows in the `data.frame`.
In our survey dataset, the species are represented by a 2-letter code (e.g.,
"AB"), however, we would like to include the species name. The correspondance
between the 2 letter code and the names are in the file `species.csv`. In this
file, one column includes the genus and another includes the species. First, we
are going to import this file in memory:
```{r, purl=TRUE}
species <- read.csv("data/species.csv", stringsAsFactors=FALSE)
```
We are then going to use the function `match()` to create a vector that contains
the genus names for all our observations. The function `match()` takes at least
2 arguments: the values to be matched (in our case the 2 letter code from the
`surveys` data frame held in the column `species`), and the table that contains
the values to be matched against (in our case the 2 letter code in the `species`
data frame held in the column `species_id`). The function returns the position
of the matches in the table, and this can be used to retrieve the genus names:
```{r, purl=TRUE}
surveys_spid_index <- match(surveys$species_id, species$species_id)
surveys_genera <- species$genus[surveys_spid_index]
```
Now that we have our vector of genus names, we can add it as a new column to our
`surveys` object:
```{r, purl=TRUE}
surveys <- cbind(surveys, genus=surveys_genera)
```
### Challenge
* Use the same approach to also include the species names in the `surveys` data
frame.
```{r, echo=FALSE, purl=TRUE}
## Use the same approach to also include the species names in the
## `surveys` data frame.
```
```{r, echo=FALSE, purl=FALSE}
surveys_species <- species$species_id[surveys_spid_index]
surveys <- cbind(surveys, species_name=surveys_species)
```
* Use the help in R to understand what the function `paste()` does. Use it to
add a new column called `genus_species` into the `species` `data.frame`.
* Use the help to understand what the function `merge()` does. Use it to create
a new `data.frame` that combines the content of `surveys` and the modified
version of `species`.
* Use this data set to answer the following:
- How many birds have been captured?
- How many individuals of the genus *Dipodomys* have been captured?
# Adding rows
<!--- Even if this is not optimal, using this approach requires to cover less -->
<!--- material such as logical operations on vectors. Depending on how fast the -->
<!--- group moves, it might be better to show the correct way. -->
```{r, echo=FALSE, purl=TRUE}
## Adding rows
```
Let's create a `data.frame` that contains the information only for the species
"DO" and "DM". We know how to create the data set for each species with the
function `subset()`:
```{r, purl=FALSE}
surveys_DO <- subset(surveys, species == "DO")
surveys_DM <- subset(surveys, species == "DM")
```
Similarly to `cbind()` for columns, there is a function `rbind()` (__r__ow
__bind__) that puts together two `data.frame`. With `rbind()` the number of
columns and their names must be identical between the two objects:
```{r, purl=FALSE}
surveys_DO_DM <- rbind(surveys_DO, surveys_DM)
```
### Challenge
* Using a similar approach, construct a new `data.frame` that only includes data
for the years 2000 and 2001.
```{r, echo=FALSE, purl=TRUE}
## How many columns are now in (1) the `data.frame` `surveys`, (2) the `data.frame`
## `surveys_index`?
```
* How does it differ from `subset(surveys, species == "DO" | species == "DM")`?
# Removing columns
```{r, echo=FALSE, purl=FALSE}
## Removing columns
```
Just like you can select columns by their positions in the `data.frame` or by
their names, you can remove them similarly.
To remove it by column number:
```{r, results='show', purl=FALSE}
surveys_noDate <- surveys[, -c(3:5)]
colnames(surveys)
colnames(surveys_noDate)
```
The easiest way to remove by name is to use the `subset()` function. This time
we need to specify explicitly the argument `select` as the default is to subset
on rows (as above). The minus sign indicates the names of the columns to remove
(note that the column names should not be quoted):
```{r, results='show', purl=FALSE}
surveys_noDate2 <- subset(surveys, select=-c(month, day, year))
colnames(surveys_noDate2)
```