Web scraping Reply All transcripts
Download CSV here: https://github.com/landesbergn/reply-all/blob/master/reply_all_text_data.csv
One of my favorite podcasts is Reply All, a show (roughly) about technology and the internet. Hosts PJ Vogt and Alex Goldman, and a rotating cast of fantastic reporters and producers tell some of the most fascinating stories about the way we interact with technology. The show has been in production since 2014, and for a time felt like a great little secret, but their website now indicates that the show is downloaded “around 3.5 million times per month.” If you haven’t listened, I’d highly recommend checking it out. Some of my favorite episodes are (in no particular order):
- The Cathedral
- Boy in Photo
- The Grand Tapestry of Pepe
- Shine on You Crazy Goldman
- Long Distance pt.1 and Long Distance pt.2
- Zardulu
I thought it would be a fun project to take the transcripts from every episode of Reply All and see what we can learn about the show. As is often the case in data science, 80% of the challenge is to gather and clean the data.
Part 1: Get episode links
To gather data, we will make use of a few R packages. rvest
will do most of the web scraping, and dplyr
, tidyr
, and stringr
will help to clean and organize our data. So, without further ado, let’s load ’em up!
library(rvest) # for web scraping
library(dplyr) # for data tidying
library(stringr) # for working with strings
library(tidytext) # analyze text data!
library(ggplot2) # make nice plots
library(scales) # make plots even nicer
library(tidyr) # for data organization
library(purrr) # for iteration
To get the transcripts of all the episodes, first we need to know which episodes are out there. By navigating to the Reply All website, we can load a handy list of all episodes as well as some hyperlinks that will take us into the individual web page for each episode. This link will serve as the starting point for our web-scraping.
reply_all_url <- "https://www.gimletmedia.com/reply-all/all#all-episodes-list"
We will use some functions from rvest
to pull the hyperlinks to every episode we will eventually want to read the transcript from. This was my first time using rvest
, and I found this post from Justin Law and Jordan Rosenblum to be helpful, as well as this RStudio blog post from Hadley Wickham. To get the episode links, we will look for all HTML in the <a>
anchor tag and pull the href
attribute where the hyperlink lives.
episode_links <- reply_all_url %>%
read_html() %>% # read HTML from page
html_nodes("a") %>% # look for 'anchor' html tag <a>
html_attr("href") %>% # get href attribute from the tag, which is where the hyperlink is stored
tibble(link = .) # put the data in a tibble with the variable episode
tail(episode_links) %>%
knitr::kable(format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
link |
---|
/reply-all/1-an-app-sends-a-stranger-to-say-i-love-you#episode-player |
/ |
/about |
/careers |
/terms-of-service |
/privacy-policy |
We can tell by looking at the head
(first 6 rows) and the tail
(last 6 rows) of the list of episode links that there are some links to other pages here that are not of interest, like the privacy policy or the terms of service.
Let’s clean this up a little bit, by only including episodes that reference #episode-player
at the end of their link, indicated that there is an episode there, and not some other webpage. We can also filter out episodes that are rebroadcasts or presentations of other shows in the Reply All feed. The process of filtering down this list took some trial and error, as the formatting of episode titles in the hyperlink was inconsistent.
episode_links <- episode_links %>%
filter(
str_detect(link, "#episode-player"), # must link to an episode player
!str_detect(link, "re-broadcast"), # no re-broadcasts
!str_detect(link, "rebroadcast"), # or rebroadcasts
!str_detect(link, "revisited"), # or revisits
!str_detect(link, "-2#episode-player"), # or other rebroadcasts
!str_detect(link, "presents"), # or presentations of other shows
!str_detect(link, "introducing"), # or introductions of other shows
!str_detect(link, "updated"), # or updated versions of old episodes
link != "/reply-all/6-this-proves-everything#episode-player", # sneaky update of an old episode that is not labeled well
link != "/reply-all/104-the-case-of-the-phantom-caller#episode-player" # another one
) %>%
distinct() # and after all of that, no duplicates
episode_links <- episode_links %>%
mutate(episode_number = nrow(episode_links) + 1 - row_number()) # add in the episode number
head(episode_links) %>%
knitr::kable(format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
link | episode_number |
---|---|
/reply-all/135-the-robocall-conundrum#episode-player | 135 |
/reply-all/134-the-year-of-the-wallop#episode-player | 134 |
/reply-all/133-reply-alls-2018-year-end-extravaganza#episode-player | 133 |
/reply-all/132-negative-mount-pleasant#episode-player | 132 |
/reply-all/131-surefire-investigations#episode-player | 131 |
/reply-all/130-lizard#episode-player | 130 |
We can also add in the episode number, which I originally tried to parse out from the string, but I am not good at regular expression, so I just used some math to back in to it.
Finally, we add in the full web link, by appending the hyperlink extension to the Gimlet Media homepage (gimetmedia.com
).
ep_data <- episode_links %>%
mutate(
full_link = paste0("https://www.gimletmedia.com", link)
)
head(ep_data) %>%
knitr::kable(format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
link | episode_number | full_link |
---|---|---|
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player |
/reply-all/134-the-year-of-the-wallop#episode-player | 134 | https://www.gimletmedia.com/reply-all/134-the-year-of-the-wallop#episode-player |
/reply-all/133-reply-alls-2018-year-end-extravaganza#episode-player | 133 | https://www.gimletmedia.com/reply-all/133-reply-alls-2018-year-end-extravaganza#episode-player |
/reply-all/132-negative-mount-pleasant#episode-player | 132 | https://www.gimletmedia.com/reply-all/132-negative-mount-pleasant#episode-player |
/reply-all/131-surefire-investigations#episode-player | 131 | https://www.gimletmedia.com/reply-all/131-surefire-investigations#episode-player |
/reply-all/130-lizard#episode-player | 130 | https://www.gimletmedia.com/reply-all/130-lizard#episode-player |
Part 2: Scrape transcripts from links
Now that we have a link to every episode we want to parse, it is time to write a function that, when given a link, returns a transcript. This took a lot of time to nail down. As alluded to, the data and transcript formatting as inconsistent and messy. It took a fair amount of time and iteration to try to get this right. I spent a lot of time on https://regex101.com testing out stuff.
getTranscript <- function(episode_link) {
# print statement helps to understand the progress of the function when it is running (commented out for now)
# print(episode_link)
# get the transcript from the webpage by referencing the CSS node '.episode-transcript'
transcript <- episode_link %>%
read_html() %>%
html_nodes(".episode-transcript") %>%
html_text()
# this is an attempt at splitting up the text by speaker, so that
# each string is separated by the person who said it.
episode_text <- tibble(
unlist(strsplit(transcript, "[^.a-z]+:", perl = T)) # identify [ALEX]: or [PJ]: at the beginning of a line
) %>% setNames("text")
episode_text <- episode_text %>%
mutate(
text = trimws(text) # remove leading and trailing white-space
) %>%
filter(
tolower(text) != "transcript\n ", # get rid of line indicating start of transcript
tolower(text) != "transcript", # or this type of line indicating the start of transcript,
tolower(text) != "[theme music", # get rid of Theme (or theme music)
tolower(text) != "[intro music", # get rid of Intro Music (or intro music)
tolower(text) != "transcript\n [intro music", # or a mix of things
tolower(text) != "transcript\n [theme music", # or a different mix of things
tolower(text) != "transcript\n [reply all theme"
)
# ok this is gross, but handling some specific episodes where the transcript is not entered in a consistent way
if (episode_link == "https://www.gimletmedia.com/reply-all/79-boy-in-photo#episode-player") {
speaker <- rbind(
"PJ",
data.frame(gsubfn::strapply(transcript, "[^.a-z]+:", c, perl = TRUE), stringsAsFactors = FALSE) %>% setNames("speaker")
)
} else if (episode_link == "https://www.gimletmedia.com/reply-all/52-raising-the-bar#episode-player") {
speaker <- rbind(
"PJ",
data.frame(gsubfn::strapply(transcript, "[^.a-z]+:", c, perl = TRUE), stringsAsFactors = FALSE) %>% setNames("speaker")
)
episode_text <- episode_text %>% mutate(text = str_replace_all(text, "Transcript\n PJ Vogt: ", ""))
} else if (episode_link == "https://www.gimletmedia.com/reply-all/31-bonus-the-reddit-implosion-explainer#episode-player") {
speaker <- rbind(
"PJ",
data.frame(gsubfn::strapply(transcript, "[^.a-z]+:", c, perl = TRUE), stringsAsFactors = FALSE) %>% setNames("speaker")
)
} else if (episode_link == "https://www.gimletmedia.com/reply-all/2-instagram-for-doctors#episode-player") {
speaker <- rbind(
"PJ",
data.frame(gsubfn::strapply(transcript, "[^.a-z]+:", c, perl = TRUE), stringsAsFactors = FALSE) %>% setNames("speaker")
)
episode_text <- episode_text %>% mutate(text = str_replace_all(text, fixed("Transcript\n [THEME SONG]PJ Vogt: "), ""))
} else {
speaker <- gsubfn::strapply(transcript, "[^.a-z]+:", c, perl = TRUE) %>% setNames("speaker")
}
transcript_clean <- data.frame(episode_text, speaker) %>%
mutate(
speaker = trimws(speaker),
speaker = str_replace_all(speaker, "[^A-Z ]", ""),
speaker = str_replace_all(speaker, "THEME MUSIC", ""),
speaker = str_replace_all(speaker, "RING", ""),
speaker = str_replace_all(speaker, "MUSIC", ""),
speaker = str_replace_all(speaker, "BREAK", "")
)
# do some light cleaning of the transcript
transcript_new <- transcript_clean %>%
mutate(
linenumber = row_number() # get the line number (1 is the first line, 2 is the second, etc.)
) %>%
select(speaker, text, linenumber)
# convert normal boring text into exciting cool tidy text
transcript_tidy <- transcript_new %>%
unnest_tokens(word, text)
return(transcript_tidy)
}
Now that we have a function, let’s test it for a single episode to show it in action.
getTranscript("https://www.gimletmedia.com/reply-all/114-apocalypse-soon#episode-player") %>%
head() %>%
knitr::kable(format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
speaker | linenumber | word | |
---|---|---|---|
1 | PJ VOGT | 1 | hey |
1.1 | PJ VOGT | 1 | everybody |
1.2 | PJ VOGT | 1 | we |
1.3 | PJ VOGT | 1 | are |
1.4 | PJ VOGT | 1 | back |
1.5 | PJ VOGT | 1 | it’s |
Part 3: Pull transcripts for each episode
Now we can use purrr
to iterate through every episode and ‘map’ the function getTranscript
to each episode link. I learned a lot about iterating with purrr
from this tutorial from Jenny Brian and the chapter from R for Data Science on Iteration by Garrett Grolemund and Hadley Wickham. This takes ~3 minutes to run, depending on your internet connection.
# use purrr to map the 'getTranscript' function over all of the URLs in the ep_data data frame
ep_data <- ep_data %>%
mutate(
transcript = map(full_link, getTranscript)
)
# unnest the results into one big data frame
tidy_ep_data <- ep_data %>%
unnest(transcript)
head(tidy_ep_data) %>%
knitr::kable(format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
link | episode_number | full_link | speaker | linenumber | word |
---|---|---|---|---|---|
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | from |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | gimlet |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | this |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | is |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | reply |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | all |
Part 4: Clean the transcripts
Now that we have a big data frame, we can do a little more cleaning of the data. Arguably, this is avoidable with more intelligent regex and string work earlier, but this cleanup will have to do for now. I briefly use the zoo
package to fill in some NA values in the speaker
column using the previous non-NA value (inspired by this StackoverFlow answer).
# turn missing values to NA and then fill using
# the na.locf (last observation carried forward) function from the 'zoo' package
tidy_ep_data <- tidy_ep_data %>%
mutate(
speaker = if_else(speaker == "", NA_character_, speaker),
speaker = zoo::na.locf(speaker)
)
# get the list of speakers clean
tidy_ep_data_clean <- tidy_ep_data %>%
filter(
!grepl("CREDIT", speaker), # remove credit chit-chat
!grepl("AD ", speaker), # remove ad chit-chat
!grepl("THEME", speaker), # remove theme chit-chat
speaker != "ADPJ",
speaker != "ADALEX",
speaker != "OUTPJ",
speaker != "OUTALEX"
) %>%
mutate(
speaker = trimws(speaker),
speaker = case_when(
speaker == "ALEX" ~ "ALEX GOLDMAN",
speaker == "REPLY ALL ALEX GOLDMAN" ~ "ALEX GOLDMAN",
speaker == "GOLDMAN" ~ "ALEX GOLDMAN",
speaker == "AG" ~ "ALEX GOLDMAN",
speaker == "PJ" ~ "PJ VOGT",
speaker == "REPLY ALL PJ VOGT" ~ "PJ VOGT",
speaker == "BLUMBERG" ~ "ALEX BLUMBERG",
speaker == "AB" ~ "ALEX BLUMBERG",
speaker == "SRUTHI" ~ "SRUTHI PINNAMANENI",
TRUE ~ speaker
)
)
And after all of that, we now have some sort of nice text data from every episode of Reply All!
glimpse(tidy_ep_data_clean)
## Observations: 704,321
## Variables: 6
## $ link <chr> "/reply-all/135-the-robocall-conundrum#episode-pl…
## $ episode_number <dbl> 135, 135, 135, 135, 135, 135, 135, 135, 135, 135,…
## $ full_link <chr> "https://www.gimletmedia.com/reply-all/135-the-ro…
## $ speaker <chr> "PJ VOGT", "PJ VOGT", "PJ VOGT", "PJ VOGT", "PJ V…
## $ linenumber <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3…
## $ word <chr> "from", "gimlet", "this", "is", "reply", "all", "…
head(tidy_ep_data_clean) %>%
knitr::kable(format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
link | episode_number | full_link | speaker | linenumber | word |
---|---|---|---|---|---|
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | from |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | gimlet |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | this |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | is |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | reply |
/reply-all/135-the-robocall-conundrum#episode-player | 135 | https://www.gimletmedia.com/reply-all/135-the-robocall-conundrum#episode-player | PJ VOGT | 1 | all |
Part 5: Save to a CSV
We can write the data to a .csv
for anyone to use in the future.
readr::write_csv(tidy_ep_data_clean, "reply_all_text_data.csv")
Download CSV here: https://github.com/landesbergn/reply-all/blob/master/reply_all_text_data.csv
In a future post, I will start to analyze this data!