-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFriends.Rmd
150 lines (83 loc) · 3.34 KB
/
Friends.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "Friends"
author: "Christian"
date: "`r Sys.Date()`"
output:
prettydoc::html_pretty:
theme: hpstr
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Synthetic Friends Script
This document details the work conducted in R for this project. The primary aim of this work here is to identify the cohort of scenes used for the text model and to produce the visualisations for the article.
## Data
First we'll need to setup the libraries we'll need and import the data. This data is the friends_info.csv file from a [2020 edition](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-08/friends_info.csv) of tidy tuesday.
```{r data, message=FALSE, warning=FALSE}
library(tidyverse)
library(rjson)
library(lubridate)
data <- read_csv('Data/friends_info.csv')
head(data)
```
## Season quality
The next step is to begin to explore the episode and season quality using IMDB ratings. Given the closeness in ratings, perhaps a better method is to look at the individual episodes instead.
```{r season-quality-viz, message=FALSE, warning=FALSE, fig.align='center'}
qual_season <- data %>%
group_by(season) %>%
summarise(avg_rating = mean(imdb_rating)) %>%
mutate(season = as.factor(paste("Season", season, sep = " "))) %>%
arrange(desc(avg_rating))
season_plot <- ggplot(qual_season, aes(y = fct_reorder(season, avg_rating), x = avg_rating, fill = season)) +
geom_col() +
scale_x_continuous(
expand = c(0,0)
, limits = c(0,9)
) +
labs(
x = "Average Rating"
, y = NULL
, title = "Friends Quality by Season"
, subtitle = "Average IMDB rating, by Friends season"
, caption = "Source: Emil Hvitfeldt"
) +
theme_classic() +
theme(
axis.line.x.bottom = element_blank()
, axis.ticks.x.bottom = element_blank()
, axis.ticks.y.left = element_blank()
, panel.grid.major.x = element_line()
, legend.position = "none"
)
ggsave("Images/season_plot.png", height = 4.47, width = 7.2)
season_plot
```
## Episode Quality
Given the lack of any significant difference between the ratings of individual seasons, we'll instead focus on the individual episodes instead. After identifying the top 10 rated episodes, we'll use those for our analysis. As there are ties in the IMDB rating, we're actually going to be working with 12 episodes instead of the requested 10.
```{r episode-quality-viz }
episode_quality <- data %>%
top_n(10, imdb_rating) %>%
mutate(air_date = year(air_date)) %>%
select(-directed_by, -written_by, -us_views_millions) %>%
arrange(desc(imdb_rating))
knitr::kable(
episode_quality
, align = 'lllrr'
, col.names = c(
"Season"
, "Episode No."
, "Episode Title"
, "Year Aired"
, "IMDB Rating")
)
```
## Extract episodes
Now that we have the episodes, we'll now have to extract these so we can use them in our Jupyter Notebook. We'll be doing this by joining the 'friends_info.csv' and 'friends.csv' files together to create a 'corpus_data.csv' file that we'll load into our python environment.
```{r episode-extraction, message=FALSE, warning=FALSE}
corpus_data <- episode_quality %>%
left_join(
read_csv('Data/friends.csv')
, by = c('season', 'episode')
)
write.csv(corpus_data, 'Data/corpus_data.csv')
```