GTFS data segmentation using R #219
Replies: 3 comments 4 replies
-
There is a way but unfortunately there is no easy step-by-step workflow to do this. The main issue is identifying overlaps. If your feed has precise shapes.txt coordinates you can use sf::st_segmentize() to create segments. You then compare all segments. I'd guess you'd like to summarise shapes that are "close-by but not equal", then you'd have to use st_distance(..., which="Hausdorff") or something like that. So it's mainly a not-so-trivial geospatial issue. As you're new to R I'd suggest looking at Geocomputation with R, I don't think I can help you with that further. Summarising the number of trips (not routes, as a route might have more than one trip with a corresponding shape) on a shape can be done like this. library(tidytransit)
library(dplyr)
g = gtfs_duke |>
gtfs_as_sf()
trip_counts = g$trips |>
distinct(trip_id, shape_id) |>
count(shape_id, name = "n_trips")
shapes = left_join(g$shapes, trip_counts, "shape_id")
library(ggplot2)
ggplot(shapes) +
geom_sf(aes(linewidth = n_trips)) Created on 2024-12-17 with reprex v2.1.1 Note that in this plot there are several overlapping shapes so the linewidth doesn't really show the right number of trips. So this is just a starting point and doesn't solve the issue. To fix overlaps you'd have to solve the geospatial issue. |
Beta Was this translation helpful? Give feedback.
-
In the spirit of advent of code I couldn't help but tinker with this a bit more 🎄🎅 🎁 The following code matches nearby shape points and finds overlapping shapes along segments. It looks ok but I didn't inspect the plausibility of the example dataset further and there might be pitfalls with other dataset. So it's not a catch-all workflow but it should provide a starting point. Snap nearby shape coordinates to allow matchinglibrary(dplyr)
library(sf)
library(tidytransit)
# Add interpolated points along shape lines
add_shape_points = function(gtfs_shapes, crs, max_segment_dist) {
gtfs_shapes |>
shapes_as_sf(crs) |>
st_segmentize(max_segment_dist) |>
tidytransit:::sf_lines_to_df() |> # direct access fn in tidytransit::sf_as_tbl()
as_tibble()
}
# Build hierarchical clusters with all gtfs-shape points and combine them
# within max_dist. The returned `shapes_snapped` has the same number of shape points
# but their lonlat coordinates have been change to their cluster's center.
snap_simplify_shapes = function(gtfs_shapes, crs, max_cluster_dist) {
stopifnot(inherits(gtfs_shapes, "data.frame"))
stopifnot(!inherits(gtfs_shapes, "sf"))
# We don't want to convert shapes to actual lines (with gtfs_sf_sf() or shapes_as_sf()),
# the point coordinates in shapes.txt are actually already fine for our use case.
# However, we need to convert them from lon/lat to a projected coordinate system,
# for the duke example feed it's EPSG:32119
shapes = gtfs_shapes |>
st_as_sf(coords = c("shape_pt_lon", "shape_pt_lat"), crs = 4326, remove = FALSE) |>
st_transform(crs)
# add point coordinates as columns, round coordinates
shapes <- shapes |>
bind_cols(round(st_coordinates(shapes), 1)) |>
st_drop_geometry() |>
mutate(shape_point_id = paste0(X, "-", Y))
# use only unique point coordinates, reduce clustering runtime
coords = shapes |>
distinct(shape_point_id, X, Y)
# hierarchical clustering
htree = cluster::agnes(coords[,c("X", "Y")])
cluster_coords = function(htree, coords, k) {
coords$cluster <- cutree(htree, k = k)
# Use mean coordinates as cluster center
# Maybe this data is already available somewhere with the cluster package
coords_clustered = coords |>
group_by(cluster) |>
mutate(cluster_X = mean(X), cluster_Y = mean(Y)) |>
ungroup()
# calculate euclidian distance to the cluster center
cluster_stats = coords_clustered |>
mutate(dist_to_mean = sqrt((X-cluster_X)^2+(Y-cluster_Y)^2)) |>
summarise(min_dist = min(dist_to_mean), max_dist = max(dist_to_mean)) |>
ungroup()
attributes(coords_clustered)$cluster_stats <- cluster_stats
attributes(coords_clustered)$max_dist <- max(cluster_stats$max_dist)
coords_clustered
}
# find k value where distance to cluster center is less than max_dist (for all clusters)
# first increase k in bigger intervals, afterwards finetune k back in 1-step-decrements
for(k in seq(1, nrow(coords), 100)) {
coords_clustered = cluster_coords(htree, coords, k)
if(attributes(coords_clustered)$max_dist < max_cluster_dist) break
}
for(k in seq(k, 1, -1)) {
coords_clustered = cluster_coords(htree, coords, k)
if(attributes(coords_clustered)$max_dist > max_cluster_dist) break
}
# join shapes with new clustered coordinates and transform back to WGS 84
shapes_snapped = shapes |>
inner_join(coords_clustered, c("shape_point_id", "X", "Y")) |>
select(-shape_pt_lat, -shape_pt_lon, -X, -Y) |>
st_as_sf(coords = c("cluster_X", "cluster_Y"), crs = crs) |>
st_transform(4326)
# use new cluster lon/lat coordinates and trim columns to original columns
shapes_snapped <- shapes_snapped |>
bind_cols(round(st_coordinates(shapes_snapped), 6)) |> # up to 11 cm accuracy
st_drop_geometry() |>
rename(shape_pt_lon = X, shape_pt_lat = Y)
shapes_snapped <- shapes_snapped[,colnames(gtfs_shapes)]
return(shapes_snapped)
} With the functions out of the way, let's simplify the shapes for the example gtfs_duke dataset: # Option 1: snap shapes with existing shape point coordinates
# new_shapes = snap_simplify_shapes(gtfs_duke$shapes, 32119, 15)
# Option 2: add more points along shape line to better match lines
# between original shape points
new_shapes = gtfs_duke$shapes |>
add_shape_points(32119, 20) |>
snap_simplify_shapes(32119, 30) Compare original shapes to snapped shapes: mapview::mapview(shapes_as_sf(gtfs_duke$shapes), color = "blue", legend = F)@map |>
leaflet::setView(-78.934, 36.005, 17) mapview::mapview(shapes_as_sf(new_shapes), color = "red", legend = F)@map |>
leaflet::setView(-78.934, 36.005, 17) Find overlapping segments of shapes# Create "segments" between individual shape points and create a table linking
# shape_ids and newly created segment_ids. Each shape has 1:n segments (probably
# way more than 1) and each segment has 1:m shapes. Segments are one-directional.
# You can build segments with the original shapes table (i.e. without snapping the
# shape coordinates in the previous step if data is already good enough
get_shape_segments = function(gtfs_shapes) {
gtfs_shapes_offset = gtfs_shapes |>
mutate(shape_pt_sequence = shape_pt_sequence-1) |>
filter(shape_pt_sequence > 0) |>
select(shape_id, shape_pt_lon, shape_pt_lat, shape_pt_sequence)
# use linestring wkt as id
# Note that there are overlapping segments with different directions (A->B and B->A)
# which are handled separately in the following workflow. To change this we'd need to
# create the same id for both directions (for example by sorting the coordinates before hashing)
create_segment_wkt = function(x1,y1,x2,y2) {
sprintf("LINESTRING (%.6f %.6f, %.6f %.6f)", x1, y1, x2, y2)
}
gtfs_shape_segments = inner_join(gtfs_shapes, gtfs_shapes_offset, c("shape_id", "shape_pt_sequence")) |>
mutate(segment_id = create_segment_wkt(shape_pt_lon.x, shape_pt_lat.x, shape_pt_lon.y, shape_pt_lat.y)) |>
distinct(shape_id, segment_id)
return(gtfs_shape_segments)
}
(shape_segments = get_shape_segments(new_shapes))
#> # A tibble: 8,355 × 2
#> shape_id segment_id
#> <chr> <chr>
#> 1 p_111287 LINESTRING (-78.914605 36.006000, -78.914605 36.006000)
#> 2 p_111287 LINESTRING (-78.914605 36.006000, -78.914368 36.005660)
#> 3 p_111287 LINESTRING (-78.914368 36.005660, -78.914368 36.005660)
#> 4 p_111287 LINESTRING (-78.914368 36.005660, -78.914435 36.005321)
#> 5 p_111287 LINESTRING (-78.914435 36.005321, -78.914435 36.005321)
#> 6 p_111287 LINESTRING (-78.914435 36.005321, -78.914761 36.005071)
#> 7 p_111287 LINESTRING (-78.914761 36.005071, -78.914761 36.005071)
#> 8 p_111287 LINESTRING (-78.914761 36.005071, -78.915102 36.005272)
#> 9 p_111287 LINESTRING (-78.915102 36.005272, -78.915102 36.005272)
#> 10 p_111287 LINESTRING (-78.915102 36.005272, -78.915571 36.005285)
#> # ℹ 8,345 more rows
segments = shape_segments |>
distinct(segment_id) |>
mutate(segment_wkt = segment_id) |>
st_as_sf(wkt = "segment_wkt", crs = 4326) Analysis workflow, find number of trips per segment# trips per day
trips_per_day = gtfs_duke$trips |>
count(service_id, name = "n_trips_on_service") |>
inner_join(gtfs_duke$.$dates_services, "service_id") |>
summarise(n_trips = sum(n_trips_on_service), .by = date)
# Example with how many trips run on a given date
n_trips_per_shape = gtfs_duke$.$dates_services |>
filter(date == "2019-09-05") |>
inner_join(gtfs_duke$trips, "service_id") |>
count(shape_id, name = "n_trips")
n_trips_per_segment = shape_segments |>
inner_join(n_trips_per_shape, "shape_id") |>
summarise(n_trips = sum(n_trips), .by = segment_id) |>
inner_join(segments, "segment_id") |>
st_as_sf() |>
arrange(n_trips)
# still a bit misleading because overlapping directional segments
library(ggplot2)
ggplot(n_trips_per_segment) +
theme_bw() +
geom_sf(aes(linewidth = n_trips, color = n_trips), lineend = "round") Created on 2024-12-19 with reprex v2.1.1 |
Beta Was this translation helpful? Give feedback.
-
@polettif Thanks so much for your detailed consideration here. It's a problem I've had on my to-do list for a while. Almost all shapes contained in GTFS files exist in Open Street Map, so it should be entirely possible to write a function to do this kind of extraction and analysis. The result would then be geographically precise, and not involve any of the inherit inaccuracies that arise from snap-to-grid kind of approaches. The idea would be to input a GTFS feed, with or without a "shapes" file, and get a network representation back with nodes as stops, and full geographic shapes in between, I'd suggest by default in |
Beta Was this translation helpful? Give feedback.
-
Hi, is there a way to cut the GTFS shapes file into segments where they overlap, and visualise the overlapped lines as a single line with frequency of bus routes on them (shown as line thickness). I am new to R so a step by step process would be helpful.
Beta Was this translation helpful? Give feedback.
All reactions