In this tutorial I will talk about how to:
- Download the Ookla open dataset
- Geocode the tiles to Kentucky counties
- Make a table of the top and bottom 20 counties by download speed
- Map the counties
There are two main ways to join these tiles to another geographic dataset: quadkeys and spatial joins. This tutorial will use the spatial join approach.
library(tigris) # county boundaries
library(tidyverse) # data cleaning
library(sf) # spatial functions
library(knitr)
library(kableExtra) # county statistics table
library(RColorBrewer) # colors
library(here) # file management
library(usethis) # download data
First, download the data to a local directory by uncommenting that line below
# download the zip folder from s3 and save to working directory
# use_zip("https://ookla-open-data.s3-us-west-2.amazonaws.com/shapefiles/performance/type%3Dfixed/year%3D2020/quarter%3D2/2020-04-01_performance_fixed_tiles.zip")
#read the shapefile.
tiles <- read_sf(here("2020-04-01_performance_fixed_tiles/gps_fixed_tiles.shp")) %>%
mutate(avg_d_kbps = as.numeric(avg_d_kbps),
avg_u_kbps = as.numeric(avg_u_kbps),
avg_lat_ms = as.numeric(avg_lat_ms))
Then, I’ll load the Kentucky county boundaries from the U.S. Census
Bureau via tigris
.
ky_counties <- tigris::counties(state = "Kentucky") %>%
select(state_code = STATEFP, geoid = GEOID, name = NAME) %>% # only keep useful variables
st_transform(4326) # transform to the same CRS as the tiles
Now I’ll join the tiles to the counties. I use left = FALSE
because I
only want to include counties that have at least 1 tile.
tiles_in_ky_counties <- st_join(ky_counties, tiles, left = FALSE)
Once the datasets are joined, we are interested in summary statistics at the county level. Since we know the average download speed per tile, we could just do a simple average. To make it more accurate, I’ll use a weighted mean, weighted by test count. This gives us the overall mean in the county if the data hadn’t been aggregated to tiles first. I’ve also included weighted means for upload speed and latency here as well.
county_stats <- tiles_in_ky_counties %>%
st_set_geometry(NULL) %>%
group_by(state_code, geoid, name) %>%
summarise(mean_dl_mbps_wt = weighted.mean(avg_d_kbps, tests) / 1000,
mean_ul_mbps_wt = weighted.mean(avg_u_kbps, tests) / 1000,
mean_lat_ms_wt = weighted.mean(avg_lat_ms, tests),
tests = sum(tests)) %>%
ungroup() %>%
left_join(fips_codes %>%
mutate(geoid = paste0(state_code, county_code)) %>%
# get nicer county and state names
select(state, geoid, long_name = county, county), by = c("geoid"))
Next we can make a summary table of just the best and worst counties.
We’ll require that counties have at least 50 tests so that the
averages are more reliable. I use kable()
here for simplicity, but you
could use any of the R packages that help with tables.
table_data <- county_stats %>%
filter(tests >= 50) %>%
mutate(rank = min_rank(-mean_dl_mbps_wt)) %>% # rank in descending order
dplyr::filter(rank <= 20 | rank >= n() - 19) %>%
mutate(`County` = paste0(long_name, ", ", state),
mean_dl_mbps_wt = round(mean_dl_mbps_wt, 2)) %>%
arrange(rank) %>%
select(`County`, `Average download speed (Mbps)` = mean_dl_mbps_wt, `Tests` = tests, `Rank` = rank)
kable(table_data, format.args = list(big.mark = ","))
County |
Average download speed (Mbps) |
Tests |
Rank |
---|---|---|---|
Jefferson County, KY |
159.02 |
72,699 |
1 |
Fayette County, KY |
154.13 |
50,730 |
2 |
Scott County, KY |
151.21 |
6,632 |
3 |
Martin County, KY |
150.30 |
1,489 |
4 |
Christian County, KY |
146.73 |
5,365 |
5 |
Oldham County, KY |
144.92 |
9,787 |
6 |
Gallatin County, KY |
144.21 |
673 |
7 |
Madison County, KY |
144.17 |
6,097 |
8 |
Clark County, KY |
144.07 |
2,525 |
9 |
Bell County, KY |
143.69 |
1,165 |
10 |
Pendleton County, KY |
141.33 |
633 |
11 |
Warren County, KY |
141.17 |
11,759 |
12 |
Grant County, KY |
140.04 |
1,355 |
13 |
Woodford County, KY |
138.50 |
5,933 |
14 |
Kenton County, KY |
137.51 |
13,883 |
15 |
McCreary County, KY |
137.23 |
634 |
16 |
Boone County, KY |
135.33 |
15,704 |
17 |
Campbell County, KY |
134.47 |
7,576 |
18 |
Bullitt County, KY |
132.58 |
8,526 |
19 |
Spencer County, KY |
132.34 |
1,872 |
20 |
Rockcastle County, KY |
39.15 |
1,083 |
97 |
Monroe County, KY |
37.63 |
681 |
98 |
Butler County, KY |
37.07 |
407 |
99 |
Green County, KY |
36.11 |
915 |
100 |
Perry County, KY |
35.74 |
4,348 |
101 |
Livingston County, KY |
34.60 |
728 |
102 |
Garrard County, KY |
34.03 |
2,252 |
103 |
Wayne County, KY |
32.78 |
2,659 |
104 |
Breckinridge County, KY |
31.20 |
1,500 |
105 |
Harlan County, KY |
30.28 |
2,024 |
106 |
Leslie County, KY |
29.07 |
1,303 |
107 |
Lyon County, KY |
28.46 |
1,042 |
108 |
Knott County, KY |
27.19 |
2,426 |
109 |
Letcher County, KY |
26.68 |
3,061 |
110 |
Casey County, KY |
26.28 |
727 |
111 |
Bracken County, KY |
24.30 |
535 |
112 |
Breathitt County, KY |
23.16 |
1,197 |
113 |
Lee County, KY |
22.90 |
130 |
114 |
Hickman County, KY |
20.42 |
192 |
115 |
Hancock County, KY |
12.35 |
359 |
116 |
The table is good for a quick glance at overall patterns (what are the overall maxima and minima? where is the fastest speed?) but unless you’re already familiar with these areas it can be hard to picture where they are on a map. To go along with the table we can produce a quick choropleth map that will help give a more visual representation.
We can join our county statistics table to the basemap (remember, we already got rid of the geometry from that county statistics table). I’m also creating a categorical variable from the continuous download speed because people aren’t great at reading continuous color schemes. People can read discrete legends much more easily, with 7 categories maximum (this can depend on the situation, though).
One thing that helps people orient themselves on a map is including
major place names. The {tigris}
package makes it fairly easy to get a
quick list!
set.seed(1) # get the same random sample each time
ky_places <- places(state = "Kentucky") %>%
filter(PCICBSA == "Y") %>% # principal cities only
st_centroid() %>%
mutate(NAME = if_else(NAME == "Louisville/Jefferson County metro government (balance)", "Louisville", NAME)) %>% # shorten the name for Louisville
sample_n(15) # just get a random 10 places
county_stats_sf <- ky_counties %>%
select(geoid) %>%
left_join(county_stats %>% mutate(geoid = as.character(geoid)), by = c("geoid")) %>%
mutate(mean_dl_mbps_wt = case_when(tests < 50 ~ NA_real_,
TRUE ~ mean_dl_mbps_wt)) %>% # at least 50 tests
mutate(dl_cat = cut(mean_dl_mbps_wt, c(0, 25, 50, 100, 150, 200), ordered_result = TRUE))
ggplot() +
geom_sf(data = county_stats_sf, aes(fill = dl_cat), color = "gray20", lwd = 0.1) +
geom_sf_text(data = ky_places, aes(label = NAME), color = "black", size = 3) +
theme_void() +
scale_fill_manual(values = brewer.pal(n = 6, name = "BuPu"),
na.value = "gray80",
labels = c("0 to 25", "25.1 to 50", "50.1 to 100", "100.1 to 150", "150.1 to 200", "Insufficient data"),
name = "Mean download speed (Mbps)",
guide = guide_legend(direction = "horizontal", title.position = "top", nrow = 1, label.position = "bottom", keyheight = 0.5, keywidth = 5)) +
theme(text = element_text(color = "gray25"),
legend.position = "top")
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] usethis_1.6.1 here_0.1 RColorBrewer_1.1-2 kableExtra_1.1.0
## [5] knitr_1.29 sf_0.8-0 forcats_0.5.0 stringr_1.4.0
## [9] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0
## [13] tibble_3.0.1 ggplot2_3.3.2 tidyverse_1.3.0 tigris_1.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 jsonlite_1.7.0 viridisLite_0.3.0 modelr_0.1.8
## [5] assertthat_0.2.1 highr_0.8 sp_1.3-2 blob_1.2.1
## [9] cellranger_1.1.0 yaml_2.2.1 pillar_1.4.4 backports_1.1.8
## [13] lattice_0.20-38 glue_1.4.1 uuid_0.1-2 digest_0.6.25
## [17] rvest_0.3.5 colorspace_1.4-1 htmltools_0.5.0 pkgconfig_2.0.3
## [21] broom_0.5.6 haven_2.3.1 scales_1.1.1 webshot_0.5.1
## [25] farver_2.0.3 generics_0.0.2 ellipsis_0.3.1 withr_2.2.0
## [29] cli_2.0.2 magrittr_1.5 crayon_1.3.4 readxl_1.3.1
## [33] maptools_0.9-8 evaluate_0.14 fs_1.4.2 fansi_0.4.1
## [37] nlme_3.1-140 xml2_1.3.2 foreign_0.8-71 class_7.3-15
## [41] tools_3.6.1 hms_0.5.3 lifecycle_0.2.0 munsell_0.5.0
## [45] reprex_0.3.0 compiler_3.6.1 e1071_1.7-3 rlang_0.4.7
## [49] classInt_0.4-2 units_0.6-5 grid_3.6.1 rstudioapi_0.11
## [53] rappdirs_0.3.1 rmarkdown_2.3 gtable_0.3.0 DBI_1.1.0
## [57] curl_4.3 R6_2.4.1 lubridate_1.7.9 rgdal_1.4-6
## [61] rprojroot_1.3-2 KernSmooth_2.23-15 stringi_1.4.6 Rcpp_1.0.3
## [65] vctrs_0.3.4 dbplyr_1.4.4 tidyselect_1.1.0 xfun_0.15