UQ Library 2024-09-17
<script src="time_series_files/libs/htmlwidgets-1.6.4/htmlwidgets.js"></script> <script src="time_series_files/libs/plotly-binding-4.10.4/plotly.js"></script> <script src="time_series_files/libs/setprototypeof-0.1/setprototypeof.js"></script> <script src="time_series_files/libs/typedarray-0.1/typedarray.min.js"></script> <script src="time_series_files/libs/jquery-3.5.1/jquery.min.js"></script> <script src="time_series_files/libs/crosstalk-1.2.1/js/crosstalk.min.js"></script> <script src="time_series_files/libs/plotly-main-2.11.1/plotly-latest.min.js"></script>- Essential shortcuts
- Open RStudio
- Disclaimer
- What are we going to learn?
- Material
- About the data
- Part 1: working with time series data
- Data cleaning
- Visualisation with ggplot2
- Summarise the data
- Export summarised data
- Interactive visualisation
- Rolling operations
- Part 2: time series analysis
- Models for time series data
- Resources
- function or dataset help: press F1 with your cursor anywhere in a function name.
- execute from script: Ctrl + Enter
- assignment operator (
<-
): Alt + -
On library computers:
- Log in with your UQ username and password (use your student credentials if you are both staff and student)
- Make sure you have a working internet connection
- Go to search the magnifying glass (bottom left)
- Open the ZENworks application
- Look for the letter R
- Double click on RStudio which will install both R and RStudio
If you are using your own laptop:
- Make sure you have a working internet connection
- Open RStudio
We will assume basic knowledge of R and RStudio for this course
including installing and loading packages, reading in data, creating
objects in R, tranforming data frames and tibbles with dplyr
package.
This hands-on session is broken into two parts
In Part 1 you will:
- read in data from multiple excel sheets
- clean the data and extract information
- use different date/time data formats
- visualize time series data, “rolling” operations
And Part 2 will focus on:
- investigating aspects of a times series such as trends, seasonality, and stationarity
- assessing autocorrelation
- applying different models
Install tidyverse, readxl, lubridate, plotly, RcppRoll, zoo, forecast, tseries, and xts if you don’t already have them, with:
install.packages("tidyvserse")
- the data science ‘metapackage’install.packages("readxl")
- reading in data from excelinstall.packages("lubridate")
- to work with date and time datainstall.packages("plotly")
- interactive plotsinstall.packages("RcppRoll")
- rolling average packageinstall.packages("zoo")
- “Zeileis ordered observations” irregularly spaced time series using the zooinstall.packages("forecast")
- seasonality componentsinstall.packages("tseries")
- test for stationarityinstall.packages("xts")
- moving averages
Create a new project to keep everything nicely contained in one directory:
- Click the “Create a project” button (top left cube icon)
- Click “New Directory”
- Click “New Project” (“Empty project” if you have an older version of RStudio)
- In “Directory name”, type the name of your project, e.g. “ggplot2_intermediate”
- Select the folder where to locate your project:
e.g.
Documents/RProjects
, which you can create if it doesn’t exist yet. You can use your H drive at UQ to make sure you can find it again. - Click the “Create Project” button
Let’s also create a “data” and “images” folder to store exports:
dir.create("data")
dir.create("images")
Our dataset describes atmospheric samples of a compound which were collected each day during seven consecutive days for different month in the year. Some years and months had less samples due to technical issues.
Let’s download our dataset form the web:
download.file("https://github.com/uqlibrary/technology-training/blob/master/R/timeseries/data/analytes_data.xlsx"
destfile = "data/analytes_data.xlsx",
mode = 'wb')
The
mode = 'wb'
is binary and necessary fordownload.file()
to work on Windows OS.
We have an XLSX workbook that contains several sheets. The first one is only documenting what the data is about, whereas the two other ones contain the data we are interested in.
The package readxl is useful for importing data stored in XLS and XLSX files. For example, to have a look at a single sheet of data, we can do the following:
# load the package
library(readxl)
# only import the second sheet
analytes <- read_excel("data/analytes_data.xlsx",
sheet = 2)
We could also point to the correct sheet by using the sheet name instead
of its index. For that, the excel_sheets()
function is useful to find
the names:
# excel_sheets() shows the sheet names
excel_sheets("data/analytes_data.xlsx")
[1] "infromation data " "Site_759" "Site_1335"
analytes <- read_excel("data/analytes_data.xlsx", sheet = "Site_759")
Let’s have a look at the first few rows of data:
head(analytes)
# A tibble: 6 × 4
`Site code` Analyte `Real date` `mg/day`
<dbl> <chr> <dttm> <dbl>
1 759 Compound x 1991-11-29 00:00:00 0.334
2 759 Compound x 1991-11-30 00:00:00 0.231
3 759 Compound x 1991-12-01 00:00:00 0.216
4 759 Compound x 1991-12-02 00:00:00 0.219
5 759 Compound x 1991-12-03 00:00:00 0.203
6 759 Compound x 1991-12-04 00:00:00 0.206
Even though this workbook only has two sheets of data, we might want to automate the reading and binding of all data sheets to avoid repeating code. This comes in very handy if you have a workbook with a dozen sheets of data, or if your data is split between several files.
The Tidyverse’s purrr package allows “mapping” a function (or a more complex command) to several elements. Here, we will map the reading of the sheet to each element in a vector of sheet names.
Using the map_dfr()
function makes sure we have a single data frame as
an output.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# only keep sheet names that contain actual data
sheets <- excel_sheets("data/analytes_data.xlsx")[2:3]
# map the reading to each sheet
analytes <- map_dfr(sheets,
~ read_excel("data/analytes_data.xlsx", sheet = .x))
We could map a function by simply providing the name of the function.
However, because we are doing something slightly more elaborate here
(pointing to one single file, and using an extra argument to point to
the sheet itself), we need to use the ~
syntax, and point to the
element being processed with the .x
placeholder.
For more information on the different options the
map
family offers, see?map
.
There are a few issues with the dataset. First of all, there are variations in how the compound is named. We can replace the value in the first column with a simpler, consistent one:
# all same compound
analytes$Analyte <- "x"
Our column names are not the most reusable names for R. Better names do
not contain spaces or special characters like /
. dplyr’s rename()
function is very handy for that:
library(dplyr)
analytes <- rename(analytes, Site = 1, Date = 3, mg_per_day = 4)
Finally, the Site column is stored as numeric data. If we plot it as it
is, R will consider it to be a continuous variable, when it really
should be discrete. Let’s fix that with dplyr’s mutate()
function:
analytes <- mutate(analytes, Site = as.character(Site))
We could convert it to a factor instead, but the Tidyverse packages tend to be happy with categorical data stored as the character type.
We now have a clean dataset in a single table, which we could make a copy of, especially to share with others, or if we want to split our code into several scripts that can work independently.
write.csv(analytes, "data/analytes_data_clean.csv",
row.names = FALSE)
write.csv()
will by default include a column of row names in the exported file, which are the row numbers if no row names have been assigned. That’s not usually something we want, so we can turn it off withrow.names = FALSE
At this stage, we can start exploring visually. For a lot of R users, the go-to package for data visualisation is ggplot2, which is also part of the Tidyverse.
For a ggplot2 visualisations, remember that we usually need these three essential elements:
- the dataset
- the mapping of aesthetic elements to variables in the dataset
- the geometry used to represent the data
Let’s try a first timeline visualisation with a line plot:
library(ggplot2)
ggplot(analytes, # data
aes(x = Date, # mapping of aesthetics
y = mg_per_day,
colour = Site)) + # (separate by site)
geom_line() # geometry
A simple line plot is not great here, because of the periodicity: there were bursts of sampling, several days in a row, and then nothing for a while. Which results in a fine, daily resolution for small periods of time, and a straight line joining these periods of time.
We might want to “smoothen” that line, hoping to get a better idea of the trend, keeping the original data as points in the background:
ggplot(analytes, aes(x = Date, y = mg_per_day, colour = Site)) +
geom_point() +
geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The trend lines only give a very general trend. What if we make it follow the points more closely?
ggplot(analytes, aes(x = Date, y = mg_per_day, colour = Site)) +
geom_point(size = 0.3) + # smaller points
geom_smooth(span = 0.05) # follow the data more closely
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
With the method used, we end up with an increased uncertainty (the shaded area around the curves). It also creates artificial “dips” to fit the data, for example close to the beginning of 2000 for the site 1335 (it even reaches negative values).
In this case, because we have sampling points for what looks like groups of successive days, we can try to summarise them.
Operations on time-date data can be done more comfortably with extra packages. The Tidyverse comes with the lubridate package, which has been around for a while and is very powerful. Another, more recent package called “clock” can do most of what lubridate can, and more, but it is still being heavily developed, so we stick to lubridate here.
Let’s start by extracting all the date components that could be useful:
library(lubridate)
analytes <- analytes %>%
mutate(year = year(Date),
month = month(Date),
day = day(Date),
week = week(Date),
weekday = weekdays(Date))
How many sampling days per month are there?
analytes %>%
group_by(Site, year, month) %>%
count() %>%
head(12)
# A tibble: 12 × 4
# Groups: Site, year, month [12]
Site year month n
<chr> <dbl> <dbl> <int>
1 1335 1991 11 3
2 1335 1991 12 3
3 1335 1992 1 2
4 1335 1992 2 5
5 1335 1992 3 5
6 1335 1992 4 2
7 1335 1992 5 4
8 1335 1992 6 3
9 1335 1992 7 2
10 1335 1992 8 4
11 1335 1992 10 7
12 1335 1992 12 6
The number of samples per month is irregular, and some months have no data.
Furthermore, the week numbers don’t align with the sampling weeks, and some sampling weeks overlap over two months:
analytes %>% select(year, month, day, week) %>% head(10)
# A tibble: 10 × 4
year month day week
<dbl> <dbl> <int> <dbl>
1 1991 11 29 48
2 1991 11 30 48
3 1991 12 1 48
4 1991 12 2 48
5 1991 12 3 49
6 1991 12 4 49
7 1991 12 5 49
8 1992 1 31 5
9 1992 2 1 5
10 1992 2 2 5
In any case, the fact that week numbers are reset at the beginning of the year wouldn’t help.
One way to group the sampling days together is to detect which ones are
spaced by one day, and which ones by a lot more. The lag()
and
lead()
functions from dplyr are very useful to compare values in a
single column:
analytes <- analytes %>%
arrange(Site, Date) %>% # make sure it is in chronological order
group_by(Site) %>% # deal with sites separately
mutate(days_lapsed = as.integer(Date - lag(Date))) %>% # compare date to previous date
ungroup() # leftover grouping might have unintended consequences later on
Grouping by site is important, otherwise we get an erroneous value at the row after switching to the second site. Because we grouped, it does not compare to the previous value in the different site, but instead only returns an
NA
.
How consistent are the sampling periods? Let’s investigate:
analytes %>%
count(days_lapsed) %>%
head()
# A tibble: 6 × 2
days_lapsed n
<int> <int>
1 1 608
2 2 4
3 3 3
4 39 1
5 42 1
6 43 5
It looks like some sampling days might have been missed, so we can define a sampling period as “a period in which sampling times are not spaced by more than 3 days”.
To create a grouping index, we can first assign a value of TRUE
to the
first row of each time period, and then use the cumulative sum function
on that column (as it converts TRUE
s to 1s and FALSE
s to 0s):
analytes <- analytes %>%
group_by(Site) %>%
mutate(sampling_period = row_number() == 1 | days_lapsed > 3,
sampling_period = cumsum(sampling_period)) %>%
ungroup()
We can now use these new group indices to summarise by time period:
analytes_summary <- analytes %>%
group_by(Analyte, Site, sampling_period) %>% # we are keeping Analyte
summarise(Date = round_date(mean(Date), unit = "day"),
mg_per_day = mean(mg_per_day)) %>%
ungroup()
`summarise()` has grouped output by 'Analyte', 'Site'. You can override using
the `.groups` argument.
We chose to average and round the date for each sampling period, but we could opt for another option depending on what we want to do, for example keeping the actual time interval:
interval(start = min(Date), end = max(Date))
Let’s try again our line plot with the summarised data:
ggplot(analytes_summary,
aes(x = Date,
y = mg_per_day,
colour = Site)) +
geom_line()
This is a lot cleaner than what we had originally!
We have previously exported a CSV, which is a great, simple format that can be opened pretty much anywhere. However, if you want to save an R object to reopen it exactly as it was, you can use an R-specific format like RData.
save(analytes_summary, file = "data/summary.RData")
The file can then be imported again with the load()
function. You
won’t need to convert the columns to the correct data type again.
Exploring timelines might be more comfortable with an interactive visualisation. Plotly is a helpful library available for various programming languages, and the plotly package makes it easy to use it in R.
Once a visualisation is created in R, it is trivial to convert it to a
Plotly visualisation with one single function: ggplotly()
.
# save as an object
p <- ggplot(analytes_summary,
aes(x = Date,
y = mg_per_day,
colour = Site)) +
geom_line()
# turn it into a plotly visualisation
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
ggplotly(p)
mg_per_day: 0.21260097
Site: 1335","Date: 1992-02-02
mg_per_day: 0.26409823
Site: 1335","Date: 1992-03-30
mg_per_day: 0.19269962
Site: 1335","Date: 1992-05-31
mg_per_day: 0.21574949
Site: 1335","Date: 1992-08-02
mg_per_day: 0.26905370
Site: 1335","Date: 1992-10-05
mg_per_day: 0.18834451
Site: 1335","Date: 1992-12-06
mg_per_day: 0.09494631
Site: 1335","Date: 1993-02-07
mg_per_day: 0.16351067
Site: 1335","Date: 1993-04-13
mg_per_day: 0.20807004
Site: 1335","Date: 1993-06-06
mg_per_day: 0.24936087
Site: 1335","Date: 1993-08-08
mg_per_day: 0.26921357
Site: 1335","Date: 1993-10-03
mg_per_day: 0.26448656
Site: 1335","Date: 1993-12-05
mg_per_day: 0.37014955
Site: 1335","Date: 1994-02-06
mg_per_day: 0.33177930
Site: 1335","Date: 1994-04-03
mg_per_day: 0.32827654
Site: 1335","Date: 1994-06-05
mg_per_day: 0.32467068
Site: 1335","Date: 1994-08-05
mg_per_day: 0.26865238
Site: 1335","Date: 1994-10-06
mg_per_day: 0.31360895
Site: 1335","Date: 1994-12-05
mg_per_day: 0.26773691
Site: 1335","Date: 1995-02-05
mg_per_day: 0.34481141
Site: 1335","Date: 1995-04-16
mg_per_day: 0.34235679
Site: 1335","Date: 1995-06-03
mg_per_day: 0.45197699
Site: 1335","Date: 1995-08-06
mg_per_day: 0.44669432
Site: 1335","Date: 1995-10-15
mg_per_day: 0.37206745
Site: 1335","Date: 1995-12-03
mg_per_day: 0.46843950
Site: 1335","Date: 1996-02-04
mg_per_day: 0.33436319
Site: 1335","Date: 1996-04-06
mg_per_day: 0.38241556
Site: 1335","Date: 1996-06-09
mg_per_day: 0.38570498
Site: 1335","Date: 1996-08-04
mg_per_day: 0.46601276
Site: 1335","Date: 1996-10-12
mg_per_day: 0.65457749
Site: 1335","Date: 1996-11-30
mg_per_day: 0.66571688
Site: 1335","Date: 1997-02-08
mg_per_day: 0.57168051
Site: 1335","Date: 1997-04-05
mg_per_day: 0.40724687
Site: 1335","Date: 1997-06-07
mg_per_day: 0.41195379
Site: 1335","Date: 1997-08-05
mg_per_day: 0.50863024
Site: 1335","Date: 1997-10-07
mg_per_day: 0.60921090
Site: 1335","Date: 1997-12-02
mg_per_day: 0.67117759
Site: 1335","Date: 1998-02-10
mg_per_day: 0.84627446
Site: 1335","Date: 1998-04-07
mg_per_day: 0.26116715
Site: 1335","Date: 1998-06-02
mg_per_day: 0.25104042
Site: 1335","Date: 1998-08-13
mg_per_day: 0.32420921
Site: 1335","Date: 1998-12-01
mg_per_day: 0.45786933
Site: 1335","Date: 1999-02-02
mg_per_day: 0.35710367
Site: 1335","Date: 1999-04-06
mg_per_day: 0.32072347
Site: 1335","Date: 1999-06-02
mg_per_day: 0.27374589
Site: 1335","Date: 1999-08-08
mg_per_day: 0.35837799
Site: 1335","Date: 2000-02-01
mg_per_day: 0.46640692
Site: 1335","Date: 2000-04-04
mg_per_day: 0.48207736
Site: 1335","Date: 2000-05-30
mg_per_day: 0.26009646
Site: 1335","Date: 2000-08-12
mg_per_day: 0.24478730
Site: 1335","Date: 2000-09-26
mg_per_day: 0.18937818
Site: 1335"],"type":"scatter","mode":"lines","line":{"width":1.8897637795275593,"color":"rgba(248,118,109,1)","dash":"solid"},"hoveron":"points","name":"1335","legendgroup":"1335","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[691632000,697075200,701827200,707270400,712713600,718329600,723600000,729043200,734486400,739324800,744768000,749606400,755049600,760492800,765244800,770774400,776044800,781315200,786585600,791942400,798076800,802224000,807667200,813715200,817948800,823392000,828748800,834278400,839635200,845078400,849312000,855360000,860198400,865641600,870652800,876182400,881020800,887068800,891907200,896745600,903225600,907632000,912470400,917913600,923356800,928281600,933984000,938476800,943920000,949363200,954806400,959644800,965174400,969926400],"y":[0.24064323432840157,0.23986656500271758,0.21110339781145429,0.26016235384033187,0.30429624767079344,0.26926220750427887,0.15885272737831099,0.19486603385709572,0.23692947742810214,0.2786177427809447,0.34465690058015913,0.33866378196418873,0.34314905763436382,0.40192729263915183,0.37845777933374497,0.47651867351853761,0.37661075434769431,0.39294226440192498,0.34969865706769587,0.42331136112791046,0.47129973095449534,0.42790866945220418,0.49896330652979204,0.46822721428249331,0.6014366772813734,0.49164044871490958,0.54571289864749062,0.52418866292004318,0.575373252340876,0.76791108278992048,0.70437956033559845,0.57703510732898544,0.50761530367116636,0.47435185010390057,0.84009225514584696,1.1488959510296419,0.95047474919217156,0.97251163049292855,0.47515180845895427,0.46321373234317431,0.59141097499534911,0.64898080733711705,0.72866742705371745,0.61270664975639,0.52502906712947817,0.40778707362508099,0.48358695075050356,0.54014857905344404,0.60626335470792903,0.61466798317316906,0.60319125831305798,0.55818962004978501,0.33204840411515746,0.25951184696501528],"text":["Date: 1991-12-02
mg_per_day: 0.24064323
Site: 759","Date: 1992-02-03
mg_per_day: 0.23986657
Site: 759","Date: 1992-03-29
mg_per_day: 0.21110340
Site: 759","Date: 1992-05-31
mg_per_day: 0.26016235
Site: 759","Date: 1992-08-02
mg_per_day: 0.30429625
Site: 759","Date: 1992-10-06
mg_per_day: 0.26926221
Site: 759","Date: 1992-12-06
mg_per_day: 0.15885273
Site: 759","Date: 1993-02-07
mg_per_day: 0.19486603
Site: 759","Date: 1993-04-11
mg_per_day: 0.23692948
Site: 759","Date: 1993-06-06
mg_per_day: 0.27861774
Site: 759","Date: 1993-08-08
mg_per_day: 0.34465690
Site: 759","Date: 1993-10-03
mg_per_day: 0.33866378
Site: 759","Date: 1993-12-05
mg_per_day: 0.34314906
Site: 759","Date: 1994-02-06
mg_per_day: 0.40192729
Site: 759","Date: 1994-04-02
mg_per_day: 0.37845778
Site: 759","Date: 1994-06-05
mg_per_day: 0.47651867
Site: 759","Date: 1994-08-05
mg_per_day: 0.37661075
Site: 759","Date: 1994-10-05
mg_per_day: 0.39294226
Site: 759","Date: 1994-12-05
mg_per_day: 0.34969866
Site: 759","Date: 1995-02-05
mg_per_day: 0.42331136
Site: 759","Date: 1995-04-17
mg_per_day: 0.47129973
Site: 759","Date: 1995-06-04
mg_per_day: 0.42790867
Site: 759","Date: 1995-08-06
mg_per_day: 0.49896331
Site: 759","Date: 1995-10-15
mg_per_day: 0.46822721
Site: 759","Date: 1995-12-03
mg_per_day: 0.60143668
Site: 759","Date: 1996-02-04
mg_per_day: 0.49164045
Site: 759","Date: 1996-04-06
mg_per_day: 0.54571290
Site: 759","Date: 1996-06-09
mg_per_day: 0.52418866
Site: 759","Date: 1996-08-10
mg_per_day: 0.57537325
Site: 759","Date: 1996-10-12
mg_per_day: 0.76791108
Site: 759","Date: 1996-11-30
mg_per_day: 0.70437956
Site: 759","Date: 1997-02-08
mg_per_day: 0.57703511
Site: 759","Date: 1997-04-05
mg_per_day: 0.50761530
Site: 759","Date: 1997-06-07
mg_per_day: 0.47435185
Site: 759","Date: 1997-08-04
mg_per_day: 0.84009226
Site: 759","Date: 1997-10-07
mg_per_day: 1.14889595
Site: 759","Date: 1997-12-02
mg_per_day: 0.95047475
Site: 759","Date: 1998-02-10
mg_per_day: 0.97251163
Site: 759","Date: 1998-04-07
mg_per_day: 0.47515181
Site: 759","Date: 1998-06-02
mg_per_day: 0.46321373
Site: 759","Date: 1998-08-16
mg_per_day: 0.59141097
Site: 759","Date: 1998-10-06
mg_per_day: 0.64898081
Site: 759","Date: 1998-12-01
mg_per_day: 0.72866743
Site: 759","Date: 1999-02-02
mg_per_day: 0.61270665
Site: 759","Date: 1999-04-06
mg_per_day: 0.52502907
Site: 759","Date: 1999-06-02
mg_per_day: 0.40778707
Site: 759","Date: 1999-08-07
mg_per_day: 0.48358695
Site: 759","Date: 1999-09-28
mg_per_day: 0.54014858
Site: 759","Date: 1999-11-30
mg_per_day: 0.60626335
Site: 759","Date: 2000-02-01
mg_per_day: 0.61466798
Site: 759","Date: 2000-04-04
mg_per_day: 0.60319126
Site: 759","Date: 2000-05-30
mg_per_day: 0.55818962
Site: 759","Date: 2000-08-02
mg_per_day: 0.33204840
Site: 759","Date: 2000-09-26
mg_per_day: 0.25951185
Site: 759"],"type":"scatter","mode":"lines","line":{"width":1.8897637795275593,"color":"rgba(0,191,196,1)","dash":"solid"},"hoveron":"points","name":"759","legendgroup":"759","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":26.228310502283104,"r":7.3059360730593621,"b":40.182648401826491,"l":43.105022831050235},"plot_bgcolor":"rgba(235,235,235,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":14.611872146118724},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[677626560,983845440],"tickmode":"array","ticktext":["1992","1994","1996","1998","2000"],"tickvals":[694224000,757382400,820454400,883612800,946684800],"categoryorder":"array","categoryarray":["1992","1994","1996","1998","2000"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.6529680365296811,"tickwidth":0.66417600664176002,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.68949771689498},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"y","title":{"text":"Date","font":{"color":"rgba(0,0,0,1)","family":"","size":14.611872146118724}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.042248824194696256,1.2015934332598774],"tickmode":"array","ticktext":["0.3","0.6","0.9","1.2"],"tickvals":[0.30000000000000004,0.60000000000000009,0.90000000000000024,1.2000000000000002],"categoryorder":"array","categoryarray":["0.3","0.6","0.9","1.2"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.6529680365296811,"tickwidth":0.66417600664176002,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.68949771689498},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"x","title":{"text":"mg_per_day","font":{"color":"rgba(0,0,0,1)","family":"","size":14.611872146118724}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":true,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":1.8897637795275593,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.68949771689498},"title":{"text":"Site","font":{"color":"rgba(0,0,0,1)","family":"","size":14.611872146118724}}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"source":"A","attrs":{"66d85cc56a25":{"x":{},"y":{},"colour":{},"type":"scatter"}},"cur_data":"66d85cc56a25","visdat":{"66d85cc56a25":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
To focus on a section, draw a rectangle (and double-click to reset to the full view).
With several time series plotted, it is useful to change the hover setting to “compare data on hover” with this button:
It is however possible to set a similar hover mode as a default:
ggplotly(p) %>%
layout(hovermode = "x unified")
mg_per_day: 0.21260097
Site: 1335","Date: 1992-02-02
mg_per_day: 0.26409823
Site: 1335","Date: 1992-03-30
mg_per_day: 0.19269962
Site: 1335","Date: 1992-05-31
mg_per_day: 0.21574949
Site: 1335","Date: 1992-08-02
mg_per_day: 0.26905370
Site: 1335","Date: 1992-10-05
mg_per_day: 0.18834451
Site: 1335","Date: 1992-12-06
mg_per_day: 0.09494631
Site: 1335","Date: 1993-02-07
mg_per_day: 0.16351067
Site: 1335","Date: 1993-04-13
mg_per_day: 0.20807004
Site: 1335","Date: 1993-06-06
mg_per_day: 0.24936087
Site: 1335","Date: 1993-08-08
mg_per_day: 0.26921357
Site: 1335","Date: 1993-10-03
mg_per_day: 0.26448656
Site: 1335","Date: 1993-12-05
mg_per_day: 0.37014955
Site: 1335","Date: 1994-02-06
mg_per_day: 0.33177930
Site: 1335","Date: 1994-04-03
mg_per_day: 0.32827654
Site: 1335","Date: 1994-06-05
mg_per_day: 0.32467068
Site: 1335","Date: 1994-08-05
mg_per_day: 0.26865238
Site: 1335","Date: 1994-10-06
mg_per_day: 0.31360895
Site: 1335","Date: 1994-12-05
mg_per_day: 0.26773691
Site: 1335","Date: 1995-02-05
mg_per_day: 0.34481141
Site: 1335","Date: 1995-04-16
mg_per_day: 0.34235679
Site: 1335","Date: 1995-06-03
mg_per_day: 0.45197699
Site: 1335","Date: 1995-08-06
mg_per_day: 0.44669432
Site: 1335","Date: 1995-10-15
mg_per_day: 0.37206745
Site: 1335","Date: 1995-12-03
mg_per_day: 0.46843950
Site: 1335","Date: 1996-02-04
mg_per_day: 0.33436319
Site: 1335","Date: 1996-04-06
mg_per_day: 0.38241556
Site: 1335","Date: 1996-06-09
mg_per_day: 0.38570498
Site: 1335","Date: 1996-08-04
mg_per_day: 0.46601276
Site: 1335","Date: 1996-10-12
mg_per_day: 0.65457749
Site: 1335","Date: 1996-11-30
mg_per_day: 0.66571688
Site: 1335","Date: 1997-02-08
mg_per_day: 0.57168051
Site: 1335","Date: 1997-04-05
mg_per_day: 0.40724687
Site: 1335","Date: 1997-06-07
mg_per_day: 0.41195379
Site: 1335","Date: 1997-08-05
mg_per_day: 0.50863024
Site: 1335","Date: 1997-10-07
mg_per_day: 0.60921090
Site: 1335","Date: 1997-12-02
mg_per_day: 0.67117759
Site: 1335","Date: 1998-02-10
mg_per_day: 0.84627446
Site: 1335","Date: 1998-04-07
mg_per_day: 0.26116715
Site: 1335","Date: 1998-06-02
mg_per_day: 0.25104042
Site: 1335","Date: 1998-08-13
mg_per_day: 0.32420921
Site: 1335","Date: 1998-12-01
mg_per_day: 0.45786933
Site: 1335","Date: 1999-02-02
mg_per_day: 0.35710367
Site: 1335","Date: 1999-04-06
mg_per_day: 0.32072347
Site: 1335","Date: 1999-06-02
mg_per_day: 0.27374589
Site: 1335","Date: 1999-08-08
mg_per_day: 0.35837799
Site: 1335","Date: 2000-02-01
mg_per_day: 0.46640692
Site: 1335","Date: 2000-04-04
mg_per_day: 0.48207736
Site: 1335","Date: 2000-05-30
mg_per_day: 0.26009646
Site: 1335","Date: 2000-08-12
mg_per_day: 0.24478730
Site: 1335","Date: 2000-09-26
mg_per_day: 0.18937818
Site: 1335"],"type":"scatter","mode":"lines","line":{"width":1.8897637795275593,"color":"rgba(248,118,109,1)","dash":"solid"},"hoveron":"points","name":"1335","legendgroup":"1335","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null},{"x":[691632000,697075200,701827200,707270400,712713600,718329600,723600000,729043200,734486400,739324800,744768000,749606400,755049600,760492800,765244800,770774400,776044800,781315200,786585600,791942400,798076800,802224000,807667200,813715200,817948800,823392000,828748800,834278400,839635200,845078400,849312000,855360000,860198400,865641600,870652800,876182400,881020800,887068800,891907200,896745600,903225600,907632000,912470400,917913600,923356800,928281600,933984000,938476800,943920000,949363200,954806400,959644800,965174400,969926400],"y":[0.24064323432840157,0.23986656500271758,0.21110339781145429,0.26016235384033187,0.30429624767079344,0.26926220750427887,0.15885272737831099,0.19486603385709572,0.23692947742810214,0.2786177427809447,0.34465690058015913,0.33866378196418873,0.34314905763436382,0.40192729263915183,0.37845777933374497,0.47651867351853761,0.37661075434769431,0.39294226440192498,0.34969865706769587,0.42331136112791046,0.47129973095449534,0.42790866945220418,0.49896330652979204,0.46822721428249331,0.6014366772813734,0.49164044871490958,0.54571289864749062,0.52418866292004318,0.575373252340876,0.76791108278992048,0.70437956033559845,0.57703510732898544,0.50761530367116636,0.47435185010390057,0.84009225514584696,1.1488959510296419,0.95047474919217156,0.97251163049292855,0.47515180845895427,0.46321373234317431,0.59141097499534911,0.64898080733711705,0.72866742705371745,0.61270664975639,0.52502906712947817,0.40778707362508099,0.48358695075050356,0.54014857905344404,0.60626335470792903,0.61466798317316906,0.60319125831305798,0.55818962004978501,0.33204840411515746,0.25951184696501528],"text":["Date: 1991-12-02
mg_per_day: 0.24064323
Site: 759","Date: 1992-02-03
mg_per_day: 0.23986657
Site: 759","Date: 1992-03-29
mg_per_day: 0.21110340
Site: 759","Date: 1992-05-31
mg_per_day: 0.26016235
Site: 759","Date: 1992-08-02
mg_per_day: 0.30429625
Site: 759","Date: 1992-10-06
mg_per_day: 0.26926221
Site: 759","Date: 1992-12-06
mg_per_day: 0.15885273
Site: 759","Date: 1993-02-07
mg_per_day: 0.19486603
Site: 759","Date: 1993-04-11
mg_per_day: 0.23692948
Site: 759","Date: 1993-06-06
mg_per_day: 0.27861774
Site: 759","Date: 1993-08-08
mg_per_day: 0.34465690
Site: 759","Date: 1993-10-03
mg_per_day: 0.33866378
Site: 759","Date: 1993-12-05
mg_per_day: 0.34314906
Site: 759","Date: 1994-02-06
mg_per_day: 0.40192729
Site: 759","Date: 1994-04-02
mg_per_day: 0.37845778
Site: 759","Date: 1994-06-05
mg_per_day: 0.47651867
Site: 759","Date: 1994-08-05
mg_per_day: 0.37661075
Site: 759","Date: 1994-10-05
mg_per_day: 0.39294226
Site: 759","Date: 1994-12-05
mg_per_day: 0.34969866
Site: 759","Date: 1995-02-05
mg_per_day: 0.42331136
Site: 759","Date: 1995-04-17
mg_per_day: 0.47129973
Site: 759","Date: 1995-06-04
mg_per_day: 0.42790867
Site: 759","Date: 1995-08-06
mg_per_day: 0.49896331
Site: 759","Date: 1995-10-15
mg_per_day: 0.46822721
Site: 759","Date: 1995-12-03
mg_per_day: 0.60143668
Site: 759","Date: 1996-02-04
mg_per_day: 0.49164045
Site: 759","Date: 1996-04-06
mg_per_day: 0.54571290
Site: 759","Date: 1996-06-09
mg_per_day: 0.52418866
Site: 759","Date: 1996-08-10
mg_per_day: 0.57537325
Site: 759","Date: 1996-10-12
mg_per_day: 0.76791108
Site: 759","Date: 1996-11-30
mg_per_day: 0.70437956
Site: 759","Date: 1997-02-08
mg_per_day: 0.57703511
Site: 759","Date: 1997-04-05
mg_per_day: 0.50761530
Site: 759","Date: 1997-06-07
mg_per_day: 0.47435185
Site: 759","Date: 1997-08-04
mg_per_day: 0.84009226
Site: 759","Date: 1997-10-07
mg_per_day: 1.14889595
Site: 759","Date: 1997-12-02
mg_per_day: 0.95047475
Site: 759","Date: 1998-02-10
mg_per_day: 0.97251163
Site: 759","Date: 1998-04-07
mg_per_day: 0.47515181
Site: 759","Date: 1998-06-02
mg_per_day: 0.46321373
Site: 759","Date: 1998-08-16
mg_per_day: 0.59141097
Site: 759","Date: 1998-10-06
mg_per_day: 0.64898081
Site: 759","Date: 1998-12-01
mg_per_day: 0.72866743
Site: 759","Date: 1999-02-02
mg_per_day: 0.61270665
Site: 759","Date: 1999-04-06
mg_per_day: 0.52502907
Site: 759","Date: 1999-06-02
mg_per_day: 0.40778707
Site: 759","Date: 1999-08-07
mg_per_day: 0.48358695
Site: 759","Date: 1999-09-28
mg_per_day: 0.54014858
Site: 759","Date: 1999-11-30
mg_per_day: 0.60626335
Site: 759","Date: 2000-02-01
mg_per_day: 0.61466798
Site: 759","Date: 2000-04-04
mg_per_day: 0.60319126
Site: 759","Date: 2000-05-30
mg_per_day: 0.55818962
Site: 759","Date: 2000-08-02
mg_per_day: 0.33204840
Site: 759","Date: 2000-09-26
mg_per_day: 0.25951185
Site: 759"],"type":"scatter","mode":"lines","line":{"width":1.8897637795275593,"color":"rgba(0,191,196,1)","dash":"solid"},"hoveron":"points","name":"759","legendgroup":"759","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":26.228310502283104,"r":7.3059360730593621,"b":40.182648401826491,"l":43.105022831050235},"plot_bgcolor":"rgba(235,235,235,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":14.611872146118724},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[677626560,983845440],"tickmode":"array","ticktext":["1992","1994","1996","1998","2000"],"tickvals":[694224000,757382400,820454400,883612800,946684800],"categoryorder":"array","categoryarray":["1992","1994","1996","1998","2000"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.6529680365296811,"tickwidth":0.66417600664176002,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.68949771689498},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"y","title":{"text":"Date","font":{"color":"rgba(0,0,0,1)","family":"","size":14.611872146118724}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[0.042248824194696256,1.2015934332598774],"tickmode":"array","ticktext":["0.3","0.6","0.9","1.2"],"tickvals":[0.30000000000000004,0.60000000000000009,0.90000000000000024,1.2000000000000002],"categoryorder":"array","categoryarray":["0.3","0.6","0.9","1.2"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.6529680365296811,"tickwidth":0.66417600664176002,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.68949771689498},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176002,"zeroline":false,"anchor":"x","title":{"text":"mg_per_day","font":{"color":"rgba(0,0,0,1)","family":"","size":14.611872146118724}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":true,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":1.8897637795275593,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.68949771689498},"title":{"text":"Site","font":{"color":"rgba(0,0,0,1)","family":"","size":14.611872146118724}}},"hovermode":"x unified","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"source":"A","attrs":{"66d828113376":{"x":{},"y":{},"colour":{},"type":"scatter"}},"cur_data":"66d828113376","visdat":{"66d828113376":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
With time series affected by seasonality, or with a lot of variation creating visual noise, it is sometimes useful to represent long-term trends with a rolling average.
The package RcppRoll is useful for this kind of windowed operation:
library(RcppRoll)
analytes_summary %>%
group_by(Site) %>%
mutate(rolling_mean = roll_mean(mg_per_day,
n = 6, # the size of the window
fill = NA)) %>% # to have same length
ggplot(aes(x = Date,
y = rolling_mean,
colour = Site)) +
geom_line()
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_line()`).
This method might not be the best for this data, but it proves very useful in other cases, for example for COVID-19 daily infection rates (with a 7-day window).
library(lubridate) # work with date/time data
library(zoo)
library(forecast)
library(tseries) # test for stationarity
For this section, we will go the process of analysing time series for one site.
Let’s read in the cleaned data from Part 1, filter out the same site (1335), and split the Date column with lubridate. This can be done in one go with piping.
s1335 <- read_csv("data/analytes_data_clean.csv") %>%
filter(Site == "1335") %>%
mutate(Year = year(Date),
Month = month(Date),
Day = day(Date),
Site = factor(Site)) # change Site to a factor
Rows: 720 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Analyte
dbl (2): Site, mg_per_day
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
s1335
# A tibble: 352 × 7
Site Analyte Date mg_per_day Year Month Day
<fct> <chr> <date> <dbl> <dbl> <dbl> <int>
1 1335 x 1991-11-28 0.253 1991 11 28
2 1335 x 1991-11-29 0.239 1991 11 29
3 1335 x 1991-11-30 0.197 1991 11 30
4 1335 x 1991-12-01 0.173 1991 12 1
5 1335 x 1991-12-02 0.222 1991 12 2
6 1335 x 1991-12-03 0.191 1991 12 3
7 1335 x 1992-01-30 0.298 1992 1 30
8 1335 x 1992-01-31 0.253 1992 1 31
9 1335 x 1992-02-01 0.256 1992 2 1
10 1335 x 1992-02-02 0.284 1992 2 2
# ℹ 342 more rows
ggplot(s1335,
aes(Date, mg_per_day, color = Site)) +
geom_point()
Count the number of samples per month.
s1335 %>%
group_by(Year, Month) %>%
summarize(count = n()) %>%
arrange(-count) # arrange by count column in descending order
`summarise()` has grouped output by 'Year'. You can override using the
`.groups` argument.
# A tibble: 68 × 3
# Groups: Year [10]
Year Month count
<dbl> <dbl> <int>
1 1992 10 7
2 1993 2 7
3 1993 4 7
4 1993 6 7
5 1993 8 7
6 1993 12 7
7 1994 2 7
8 1994 6 7
9 1994 8 7
10 1994 12 7
# ℹ 58 more rows
The maximum number of samples we have per month is 7. Probably not enough to do any meaningful analysis for a daily trend. Let’s average samples by month. There also can be no data gaps for a time series (ts) data class.
ave_s1335 <- s1335 %>%
group_by(Year, Month, Site) %>%
summarize(mg_per_day = mean(mg_per_day),
SD = sd(mg_per_day))
`summarise()` has grouped output by 'Year', 'Month'. You can override using the
`.groups` argument.
Alternatively, can graphically summarize the distribution of dates using
the hist()
(hist.Date()
) function.
hist(as.Date(s1335$Date), # change POSIXct to Date object
breaks = "months",
freq = TRUE,
xlab = "", # remove label so doesn't overlap with date labels,
format = "%b %Y", # format the date label, mon year
las = 2)
Let’s convert our data into a time series. Times series data must be sampled at equispaced points in time.
There are several different time series object that have different functionalities such as working with irregularly spaced time series. See this resource.
ts_1335 <- ts(ave_s1335$mg_per_day, frequency = 12,
start = c(1991, 11), end = c(2000, 9))
class(ts_1335) # check the object class
[1] "ts"
ts_1335 # see the data
Jan Feb Mar Apr May Jun
1991
1992 0.27545325 0.25955623 0.17283015 0.24237330 0.21492034 0.21685501
1993 0.24936087 0.26921357 0.30978222 0.25693728 0.37014955 0.33177930
1994 0.34481141 0.34235679 0.36426187 0.46659618 0.44669432 0.37206745
1995 0.65457749 0.69834479 0.62221301 0.57168051 0.40724687 0.41195379
1996 0.24812201 0.25220778 0.32420921 0.40677658 0.49618890 0.34124538
1997 0.49926914 0.48207736 0.26538307 0.24687993 0.24478730 0.18937818
1998 0.21492034 0.21685501 0.21884272 0.29415920 0.18834451 0.09494631
1999 0.37014955 0.33177930 0.27831867 0.33660285 0.32467068 0.26865238
2000 0.44669432 0.37206745 0.39990788 0.47986144 0.33436319 0.38241556
Jul Aug Sep Oct Nov Dec
1991 0.22975903 0.19544291
1992 0.21884272 0.29415920 0.18834451 0.09494631 0.16351067 0.20807004
1993 0.27831867 0.33660285 0.32467068 0.26865238 0.31360895 0.26773691
1994 0.39990788 0.47986144 0.33436319 0.38241556 0.38570498 0.46601276
1995 0.50863024 0.60921090 0.66656370 0.67302314 0.84627446 0.26116715
1996 0.36344698 0.32072347 0.29003143 0.26723167 0.35837799 0.42259062
1997 0.22975903 0.19544291 0.27545325 0.25955623 0.17283015 0.24237330
1998 0.16351067 0.20807004 0.24936087 0.26921357 0.30978222 0.25693728
1999 0.31360895 0.26773691 0.34481141 0.34235679 0.36426187 0.46659618
2000 0.38570498 0.46601276 0.65457749
The zoo
class is a flexible time series data with an ordered time
index. The data is stored in a matrix with vector date information
attached. Can be regularly or irregularly spaced.
document.
library(zoo)
z_1335 <- zoo(s1335$mg_per_day, order.by = s1335$Date)
head(z_1335)
1991-11-28 1991-11-29 1991-11-30 1991-12-01 1991-12-02 1991-12-03
0.2530688 0.2390078 0.1972005 0.1728767 0.2224569 0.1909951
Decomposition separates out a times series
These elements can be additive when the seasonal component is relatively constant over time.
Or multiplicative when seasonal effects tend to increase as the trend increases.
The decompose()
function uses a moving average (MA) approach to filter
the data. The window or period over which you after is based on the
frequency of the data. For example, monthly data can be averaged across
a 12 month period. Original code from Time Series Analysis with R Ch.
7.2.1.
library(xts)
######################### Warning from 'xts' package ##########################
# #
# The dplyr lag() function breaks how base R's lag() function is supposed to #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
# source() into this session won't work correctly. #
# #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
# dplyr from breaking base R's lag() function. #
# #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
# #
###############################################################################
Attaching package: 'xts'
The following objects are masked from 'package:dplyr':
first, last
xts_1335 <- as.xts(x = ts_1335)
k2 <- rollmean(xts_1335, k = 2)
k4 <- rollmean(xts_1335, k = 4)
k8 <- rollmean(xts_1335, k = 8)
k16 <- rollmean(xts_1335, k = 16)
k32 <- rollmean(xts_1335, k = 32)
kALL <- merge.xts(xts_1335, k2, k4, k8, k16, k32)
head(kALL)
xts_1335 k2 k4 k8 k16 k32
Nov 1991 0.2297590 0.2126010 NA NA NA NA
Dec 1991 0.1954429 0.2354481 0.2400529 NA NA NA
Jan 1992 0.2754533 0.2675047 0.2258206 NA NA NA
Feb 1992 0.2595562 0.2161932 0.2375532 0.2258988 NA NA
Mar 1992 0.1728301 0.2076017 0.2224200 0.2245342 NA NA
Apr 1992 0.2423733 0.2286468 0.2117447 0.2368738 NA NA
plot.xts(kALL, multi.panel = TRUE)
Let’s use the use the stats::decompose()
function for an additive
model:
decomp_1335 <- decompose(ts_1335, type = "additive") # additive is the default
plot(decomp_1335)
In the top ‘observed’ plot there does not appear to be a clear case of seasonality increasing over time so the additive model should be fine. There is a huge peak in the trend in 1995 which decreases until around 1998 before increasing again.
stl_1335 <- stl(ts_1335, s.window = "periodic") # deompose into seasonal, trend, and irregular components
head(stl_1335$time.series)
seasonal trend remainder
Nov 1991 0.021512597 0.2253694 -0.017122969
Dec 1991 -0.019995577 0.2243884 -0.008949903
Jan 1992 0.035564624 0.2234074 0.016481258
Feb 1992 0.024725533 0.2220857 0.012744950
Mar 1992 -0.007203528 0.2207641 -0.040730440
Apr 1992 0.028745667 0.2191495 -0.005521904
The seasonal and reminder/irregular components are small compared to the trend component.
Let’s seasonally adjust the data and plot the raw data and adjusted data.
sa_1335 <- seasadj(stl_1335) # seasonally adjusted data
par(mfrow = c(2,1))
plot(ts_1335) #, type = "1")
plot(sa_1335)
These two plots are pretty much the same. There does not seem to be a large seasonality component in the data.
It can also be visualised using on the same plot to highlight the small effect of seasonality. Code modified from Time Series Analysis with R Ch. 7.3.
s1335_deseason <- ts_1335 - decomp_1335$seasonal # manually adjust for seasonality
deseason <- ts.intersect(ts_1335, s1335_deseason) # bind the two time series
plot.ts(deseason,
plot.type = "single",
col = c("red", "blue"),
main = "Original (red) and Seasonally Adjusted Series (blue)")
Plot the time series against the seasons in separate years.
par(mfrow = c(1,1))
seasonplot(sa_1335, 12, col=rainbow(12), year.labels=TRUE, main="Seasonal plot: Site 1335")
The lines do not really follow the same pattern throughout the year - again, not a big seasonality component.
The residual part of the model should be random where the model explained most significant patterns or signal in the time series leaving out the noise.
This article states that the following conditions must be met:
- The mean value of time-series is constant over time, which implies, the trend component is nullified.
- The variance does not increase over time.
- Seasonality effect is minimal.
There are a few tests for stationarity with the tseries
package:
Augmented Dickery-Fuller and KPSS. See this
section.
adf.test(ts_1335) # p-value < 0.05 indicates the TS is stationary
Augmented Dickey-Fuller Test
data: ts_1335
Dickey-Fuller = -2.0988, Lag order = 4, p-value = 0.5357
alternative hypothesis: stationary
kpss.test(ts_1335, null = "Trend") # null hypothesis is that the ts is level/trend stationary, so do not want to reject the null, p > 0.05
Warning in kpss.test(ts_1335, null = "Trend"): p-value smaller than printed
p-value
KPSS Test for Trend Stationarity
data: ts_1335
KPSS Trend = 0.25335, Truncation lag parameter = 4, p-value = 0.01
The tests indicate that the time series is not stationary. How do you make a non-stationary time series stationary?
One common way is to difference a time series - subtract each point in the series from the previous point.
Using the forecast
package, we can do seasonal differencing and
regular differencing.
nsdiffs(ts_1335, type = "trend") # seasonal differencing
Warning: The chosen seasonal unit root test encountered an error when testing for the first difference.
From seas.heuristic(): unused argument (type = "trend")
0 seasonal differences will be used. Consider using a different unit root test.
[1] 0
ndiffs(ts_1335, type = "trend") # type 'level' deterministic component is default
[1] 1
stationaryTS <- diff(ts_1335, differences= 1)
diffed <- ts.intersect(ts_1335, stationaryTS) # bind the two time series
plot.ts(diffed,
plot.type = "single",
col = c("red", "blue"),
main = "Original (red) and Differenced Series (blue)")
Let’s check the differenced time series with the same stationarity tests:
adf.test(stationaryTS)
Warning in adf.test(stationaryTS): p-value smaller than printed p-value
Augmented Dickey-Fuller Test
data: stationaryTS
Dickey-Fuller = -8.1769, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary
kpss.test(stationaryTS, null = "Trend")
Warning in kpss.test(stationaryTS, null = "Trend"): p-value greater than
printed p-value
KPSS Test for Trend Stationarity
data: stationaryTS
KPSS Trend = 0.069269, Truncation lag parameter = 4, p-value = 0.1
The both tests now indicate the differenced time series is now stationary.
Plot the autocorrelation function (ACF) correlogram for the time series. There are k lags on the x-axis and the unit of lag is sampling interval (month here). Lag 0 is always the theoretical maximum of 1 and helps to compare other lags.
The cutest explanation of ACF by Dr Allison Horst:
See the full artwork series explaining ACF here.
acf(s1335$mg_per_day)
You can used the ACF to estimate the number of moving average (MA) coefficients in the model. Here, the number of significant lags is high. The lags crossing the dotted blue line are statistically significant.
The partial autocorrelation function can also be plotted. The partial correlation is the left over correlation at lag k between all data points that are k steps apart accounting for the correlation with the data between k steps.
pacf(s1335$mg_per_day)
Practically, this can help us identify the number of autoregression (AR) coefficients in an autoregression integrated moving average (ARIMA) model. The above plot shows k = 3 so the initial ARIMA model will have three AR coefficients (AR(3)). The model will still require fitting and checking.
There is also the base::Box.test()
function that can be used to test
for autocorrelation:
Box.test(ts_1335)
Box-Pierce test
data: ts_1335
X-squared = 57.808, df = 1, p-value = 2.887e-14
The p-value is significant which means the data contains significant autocorrelations.
Most of the content below follows the great book: Introductory Time Series with R by Cowpertwait & Metcalfe. ### Autoregressive model
Autoregressive (AR) models can simulate stochastic trends by regressing the time series on its past values. Order selection is done by Arkaike Information Criterion (AIC) and method chosen here is maximum likelihood estimation (mle).
ar_1335 <- ar(ts_1335, method = "mle")
mean(ts_1335)
[1] 0.3390491
ar_1335$order
[1] 6
ar_1335$ar
[1] 0.692318630 0.056966778 -0.007328706 -0.164283447 0.023423585
[6] 0.284339996
acf(ar_1335$res[-(1:ar_1335$order)], lag = 50)
The correlogram of residuals has a few marginally significant lags (around 15 and between 30-40). The AR(6) model is a relatively good fit for the time series.
Deterministic trends and seasonal variation can be modeled using regression.
Linear models are non-stationary for time series data, thus a non-stationary time series must be differenced.
diff <- window(ts_1335, start = 1991)
Warning in window.default(x, ...): 'start' value not changed
head(diff)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov
1991 0.2297590
1992 0.2754533 0.2595562 0.1728301 0.2423733
Dec
1991 0.1954429
1992
lm_s1335 <- lm(diff ~ time(diff)) # extract the time component as the explanatory variable
coef(lm_s1335)
(Intercept) time(diff)
-11.040746179 0.005700586
confint(lm_s1335)
2.5 % 97.5 %
(Intercept) -31.137573143 9.05608078
time(diff) -0.004366695 0.01576787
acf(resid(lm_s1335))
The confidence interval does not include 0, which means there is no statistical evidence of increasing Compound X in the atmosphere. The ACF of the model residuals are significantly positively autocorrelated meaning the model likely underestimates the standard error and the confidence interval is too narrow.
Seas <- cycle(diff)
Time <- time(diff)
s1335_slm <- lm(ts_1335 ~ 0 + Time + factor(Seas))
coef(s1335_slm)
Time factor(Seas)1 factor(Seas)2 factor(Seas)3 factor(Seas)4
0.005756378 -11.122691112 -11.131937489 -11.162273796 -11.124295885
factor(Seas)5 factor(Seas)6 factor(Seas)7 factor(Seas)8 factor(Seas)9
-11.155275762 -11.202207940 -11.174639108 -11.139997655 -11.123771124
factor(Seas)10 factor(Seas)11 factor(Seas)12
-11.171495571 -11.139425944 -11.179592661
acf(resid(s1335_slm))
Generalised Least Squares model can account for some of this autocorrelation.
From 5.4 of Cowpertwait & Metcalfe, 2009:
Generalised Least Squares can be used to provide better estimates of the standard errors of the regression parameters to account for the autocorrelation in the residual series.
A correlation structure is defined using the cor
argument. The value
is estimated from the acf at lag 1 in the previous correlogram. The
residuals are approximated as an AR(1).
library(nlme)
Attaching package: 'nlme'
The following object is masked from 'package:forecast':
getResponse
The following object is masked from 'package:dplyr':
collapse
gls_s1335 <- gls(ts_1335 ~ time(ts_1335), cor = corAR1(0.7))
coef(gls_s1335)
(Intercept) time(ts_1335)
-27.9996685 0.0141997
confint(gls_s1335)
2.5 % 97.5 %
(Intercept) -87.94374371 31.94440669
time(ts_1335) -0.01582861 0.04422801
acf(resid(gls_s1335))
The confidence interval still includes 0 and the acf of the model residuals still have significant autocorrelation.
Autoregressive integrated moving average models define the model order (p, d, q).
Cookbook R explains it as:
p is the number of autoregressive coefficients, d is the degree of differencing, and q is tne number of moving average coefficients.
par(mfrow = c(2,1)) # change window so 2 rows, 1 column of plots
plot(ts_1335)
plot(diff(ts_1335))
arima_1 <- arima(ts_1335, order = c(6,1,12))
Warning in arima(ts_1335, order = c(6, 1, 12)): possible convergence problem:
optim gave code = 1
arima_1
Call:
arima(x = ts_1335, order = c(6, 1, 12))
Coefficients:
Warning in sqrt(diag(x$var.coef)): NaNs produced
ar1 ar2 ar3 ar4 ar5 ar6 ma1 ma2
0.3101 -0.1672 -0.2606 -0.5937 0.3171 -0.3346 -0.6401 0.161
s.e. 0.4987 NaN 0.1375 0.2446 0.3642 NaN 0.5091 NaN
ma3 ma4 ma5 ma6 ma7 ma8 ma9 ma10 ma11
0.2277 0.3546 -0.7157 0.5324 -0.1141 0.0339 0.0639 0.0186 -0.2655
s.e. 0.3045 0.1810 0.2855 NaN NaN NaN NaN 0.1282 0.2167
ma12
0.2497
s.e. 0.1244
sigma^2 estimated as 0.005943: log likelihood = 119.27, aic = -200.54
acf(resid(arima_1), lag = 50)
Let’s run a model with order(1, 1, 1) and compare AIC.
arima_2 <- arima(ts_1335, order = c(1, 1, 1))
arima_2
Call:
arima(x = ts_1335, order = c(1, 1, 1))
Coefficients:
ar1 ma1
0.5484 -0.8285
s.e. 0.1300 0.0795
sigma^2 estimated as 0.007826: log likelihood = 106.51, aic = -207.01
acf(resid(arima_2), lag = 50)
The second model had a lower AIC. Let’s use the forecast::auto.arima()
function from the forecast package to search for the best p, d, q.
arima_3 <- auto.arima(ts_1335, seasonal = FALSE, max.p = 20, max.q = 20)
arima_3
Series: ts_1335
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 mean
0.7719 0.3452
s.e. 0.0637 0.0362
sigma^2 = 0.007892: log likelihood = 107.77
AIC=-209.54 AICc=-209.31 BIC=-201.52
acf(resid(arima_3), lag = 50)
autoplot(ts_1335) # plot the time series
A seasonal component can also be added to ARIMA. The default for
auto.arima()
includes the seasonal component.
sarima <- auto.arima(ts_1335)
sarima
Series: ts_1335
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 mean
0.7719 0.3452
s.e. 0.0637 0.0362
sigma^2 = 0.007892: log likelihood = 107.77
AIC=-209.54 AICc=-209.31 BIC=-201.52
acf(resid(sarima), lag = 50)
The addition of the seasonal component improves the AIC and the correlogram is close to the ‘white noise’ standard.
Check out tidyverts, tidy tools for time series!
Resources used to compile this session included:
- Introductory Time Series with R by Paul Cowpertwait and Andrew Metcalfe, Springer 2009.
- Ch 14 Time Series Analysis in R Cookbook, 2nd edition, by JD Long and Paul Teetor. Copyright 2019 JD Long and Paul Teetor, 978-1-492-04068-2
- Time Series Analysis with R by Nicola Righetti
- Applied Time Series Analysis For Fisheries and Environmental Sciences by Elizabeth Holmes, Mark Scheuerell, and Eric Ward.
- Time Series Analysis article by Selva Prabhakaran
- Working with Financial Time Series Data in R document by Eric Zivot.