# install.packages("rvest")
library(rvest)
# install.packages("tidyverse")
library(tidyverse)
# Read in the list of the website addresses
<- read_csv("inputs/addresses.csv") # Just a list of the URLs
data_to_scrape # and dates for each minutes.
<- data_to_scrape$address
address_to_visit <- data_to_scrape$save_name
save_name
# Create the function that will visit address_to_visit and save to save_name files
<-
visit_address_and_save_content function(name_of_address_to_visit,
name_of_file_to_save_as) {# The function takes two inputs
read_html(name_of_address_to_visit) %>% # Go to the website and read the html
html_node("#content") %>% # Find the content part
html_text() %>% # Extract the text of the content part
write_lines(name_of_file_to_save_as) # Save as a text file
print(paste("Done with", name_of_address_to_visit, "at", Sys.time()))
# Helpful so that you know progress when running it on all the records
Sys.sleep(sample(30:60, 1)) # Space out each request by somewhere between
# 30 and 60 seconds each so that we don't overwhelm their server
}
# If there is an error then ignore it and move to the next one
<-
visit_address_and_save_content safely(visit_address_and_save_content)
# Walk through the addresses and apply the function to each
walk2(address_to_visit,
save_name,~ visit_address_and_save_content(.x, .y))
Gathering and analysing text data
Text modelling is an exciting area of research. But many guides assume that you already have a nice dataset. Similarly, web scraping is an exciting way to get information, but not many explanations go on to explain what you could do with it. This post attempts to go from scraping text from a website through to modelling the topics. It’s not meant to be an exhaustive post, but should hopefully provide enough that you can get started with your own project and know where to go for more information.
Introduction
Text modelling is an exciting area of research. But many guides assume that you already have a nice dataset. Similarly, web scraping is an exciting way to get information, but not many explanations go on to explain what you could do with it. This post attempts to go from scraping text from a website through to modelling the topics. It’s not meant to be an exhaustive post, but should hopefully provide enough that you can get started with your own project and know where to go for more information.
The example that I’m going to use is getting data from the minutes of the RBA board meeting.
Gathering data
The first step is to get some data. I’m going to use the rvest package to do the web scraping. When you are scraping data you should try to be polite - slow down your requests as much as possible, avoid times you know they’ll have a lot of traffic, and check if the website has an API or a robots.txt file (usually access that at domain.com/robots.txt) that provides guidance.
The CSV with the addresses and save names that we use looks something like this:
address | save_name |
---|---|
https://www.rba.gov.au/monetary-policy/rba-board-minutes/2018/2018-11-06.html | inputs/minutes/2018-11-06.txt |
https://www.rba.gov.au/monetary-policy/rba-board-minutes/2018/2018-10-02.html | inputs/minutes/2018-10-02.txt |
https://www.rba.gov.au/monetary-policy/rba-board-minutes/2018/2018-09-04.html | inputs/minutes/2018-09-04.txt |
https://www.rba.gov.au/monetary-policy/rba-board-minutes/2018/2018-08-07.html | inputs/minutes/2018-08-07.txt |
https://www.rba.gov.au/monetary-policy/rba-board-minutes/2018/2018-07-03.html | inputs/minutes/2018-07-03.txt |
Analysing data
In this example we’ll use a whole bunch of packages so that you can see what’s available. In general probably stringr, quanteda and stm are the workhorse packages with others used as needed.
#### Workspace set-up ####
# install.packages("broom")
library(broom) # Used to clean up results
# install.packages("devtools")
library(devtools)
# devtools::install_github("DavisVaughan/furrr")
library(furrr) # Used to do parallel processing with the topic models
plan(multiprocess)
# install.packages("quanteda")
library(quanteda) # Used for data cleaning
# install.packages("readtext")
library(readtext) # Used to read in the txt files that were scraped
# install.packages("stm")
library(stm) # Used for more interesting topic models
# install.packages("tictoc")
library(tictoc) # Used for timing
# install.packages("tidytext")
library(tidytext)
# install.packages("tidyverse")
library(tidyverse) # Used for everything
# install.packages('topicmodels')
library(topicmodels) # Used to make basic topic models
# Read in the text that we scraped earlier
<- readtext::readtext("inputs/minutes/*.txt") # readtext makes this easy,
text # but could also use the usual base approach of listing files that end in txt etc.
In general you’ll often need to do a lot of cleaning before you can do the stats bit and get results. Here, I’ll just show two example steps. I’ve found that cleaning the dataset seems to take about 80 per cent of the time.
#### Clean data ####
# Do some basic cleaning - remove puncuation and change everything to lower case
$text <- str_to_lower(text$text)
text$text <- str_replace_all(text$text, "[:punct:]", " ") text
Now that we have a plausibly clean dataset (of course you’d want to come back and clean it more if you were actually interested in analysing the RBA minutes), we can try a topic model. Topic models are essentially just summaries. Instead of a document becoming a collection of words, they become a collection of topics with some probability associated with each topic.
#### First topic modelling ####
# Convert the corpus to a form that the topic model can work with
<- quanteda::corpus(text) %>% # Minimum viable conversion
rba_minutes ::dfm(remove_punct = TRUE, remove = stopwords('en')) %>% # Get rid of
quanteda# punctuation (in case you didn't already do that) and stop words - check
# those stop words assumptions
::dfm_trim(min_termfreq = 2, # Remove any word that doesn't occur at
quanteda# least twice
min_docfreq = 2) # Get rid of any word that isn't in at least two documents
# Run the topic model with 10 topics
<- quanteda::convert(rba_minutes, to = "topicmodels") # Getting the dfm
dtm # into a form that topicmodels can deal with
<- topicmodels::LDA(dtm, k = 10) # The k is the number of topics -
lda_topics # this decision has a big impact
# Have a look at the terms
terms(lda_topics, 10) # Top 10 words for each topic. Topics are just
# probability distributions over words so you should look at different numbers of words
Looking at the words in the topics, it seems as though “per” and “cent” are being treated as separate words. The RBA is proud that it separates “per” and “cent”, and if you’re a grad there that’ll stick with you for a while (see earlier paragraphs), but for our purposes they are one word and we need to combine them.
#### Clean data ####
# Let's deal with the first issue first.
$text <- stringr::str_replace_all(text$text, "per cent", "per_cent")
text$text <- stringr::str_replace_all(text$text, "per cent", "per_cent")
text
# You could run the topic model again if you wanted.
Right, that issue of per cent has been fixed, but what if there are combinations of words like this that don’t show up very high in the topics? To identify these we need to construct n-grams. Earlier with ‘per’ ‘cent’, we generated a 2-gram. Quanteda and the tidyverse makes it easy to identify popular n-grams (if your dataset is large then I’d work with a sample of it because these can get a little unwieldy, and we only really care about the popular ones anyway). Our text is in sentences, paragraphs, etc, and we first need to break it down into tokens (essentially separate words). There’s a wonderful set of tutorials put together by the quanteda team here: https://tutorials.quanteda.io and the code for this section is from: https://tutorials.quanteda.io/basic-operations/tokens/tokens_ngrams/.
#### Adjusting for common co-location ####
<- tokens(text$text)
toks # First generate 2-grams
<- tokens_ngrams(toks, n = 2:4)
ngrams # Somewhat annoyingly for our purposes (although understandably given the broader
# picture) quanteda puts tokens into its own class, so we need ot convert in
# order to use the usual tidyverse tools that we may be more familiar with.
# As a side note, I often find it worthwhile to checking class in R when there's
# an issue because often that's part of the issue, in this case: class(ngrams).
# The tokens class seems to just be a list, so we can unlist it and then put it
# into a more-friendly tibble.
<- tibble(ngrams = unlist(ngrams)) %>%
ngram_counts count(ngrams, sort = TRUE)
# We can identify a bunch of obvious replacements. If we start getting a long
# list then we can create a file that holds the replacement.
$text <- stringr::str_replace_all(text$text, "assistant governor", "assistant_governor")
text$text <- stringr::str_replace_all(text$text, "reserve bank board", "reserve_bank_board")
text$text <- stringr::str_replace_all(text$text, "unemployment rate", "unemployment_rate")
text$text <- stringr::str_replace_all(text$text, "national accounts", "national_accounts")
text$text <- stringr::str_replace_all(text$text, "australian dollar", "australian_dollar")
text$text <- stringr::str_replace_all(text$text, "monetary policy", "monetary_policy")
text$text <- stringr::str_replace_all(text$text, "united states", "united_states")
text$text <- stringr::str_replace_all(text$text, "exchange rate", "exchange_rate")
text$text <- stringr::str_replace_all(text$text, "glenn stevens", "glenn_stevens")
text$text <- stringr::str_replace_all(text$text, "reserve bank", "reserve_bank")
text$text <- stringr::str_replace_all(text$text, "cash rate", "cash_rate")
text$text <- stringr::str_replace_all(text$text, "us dollar", "us_dollar")
text$text <- stringr::str_replace_all(text$text, "iron ore", "iron_ore")
text
rm(toks, ngrams, ngram_counts)
Take a look at the topics again. Notice that ‘growth’ is in essentially every topic. So is ‘members’ and a couple of others. It’s not that growth isn’t important (insert standard economist joke here), but the fact that ‘members’ shows up suggests that these may just be due to the way that language is used at the RBA, rather than communicating topics. If you read these minutes, you’ll know that the RBA starts a LOT of sentences with ‘Members noted…’. What does this mean for our purposes? Essentially, if you look at each topic by itself they seem ‘coherent’, but taken as a group it seems as though the topics are too similar. Another way to say that is that the words lack ‘exclusivity’. This is a common tradeoff, and our results suggest that it may be worthwhile for us to reduce some of the coherence in order to increase the exclusivity. At this point, we’ll use a different package for creating topic models - the STM package - because it has a bunch of nice features that you might like to take advantage of in future work.
#### Introducing STM and quanteda ####
<- quanteda::corpus(text) %>% # Minimum viable conversion
rba_minutes ::dfm(remove_punct = TRUE,
quantedaremove_numbers = TRUE,
remove = stopwords('en')) %>% # Get rid of punctuation (in
# case you didn't already do that) and stop words - check those stop words assumptions
::dfm_trim(min_termfreq = 2, # Remove any word that doesn't occur at least twice
quantedamin_docfreq = 0.05, # Get rid of any word that isn't in at
# least 5 per cent of documents
max_docfreq = 0.90, # Get rid of any word that is in at
# least 90 per cent of documents
docfreq_type = "prop" # Above we specified percentages - you
# could specify counts or ranks
)
# We can run the topic model using STM
<- stm(rba_minutes, K = 10)
topics_stm # Looking at the results you can see that the results are fairly similar to
# those that we got from the topicmodels package, which is what we want.
labelTopics(topics_stm)
rm(topics_stm)
# If we were interested in the results then we might like to pre-process the text
# a little more, for instance removing the names of months.
Other than pre-processing decisions, the other major determininat of the outputs of topic models is the number of topics specified. There are a bunch of diagnostic tests that have been developed to help with this decision and we can use some nice code from Julia Silge (https://juliasilge.com/blog/evaluating-stm/) to try a bunch of different values for the number of topics.
#### Deciding on the number of topics ####
tic("With parallel") # This allows us to time the code
<- data_frame(K = seq(5, 20, by = 5)) %>% # Here we're running four
many_models # topic models: 5 topics, 10 topics, 15 topics and 20 topics
mutate(topic_model = future_map(K, ~stm(rba_minutes,
K = .,
verbose = FALSE)))
toc()
# You can also try setting K to zero within STM and seeing the number of topics
# that it recommends: e,g, choose_topic_num_for_me <- stm(rba_minutes, K = 0, verbose = FALSE)
# We want to compare those models with different numbers of topics using various diagnostics.
<- make.heldout(rba_minutes) # First create a test/training set
heldout
<- many_models %>%
k_result mutate(exclusivity = map(topic_model, exclusivity), # How unique are words to the topics
semantic_coherence = map(topic_model, semanticCoherence, rba_minutes), # How
# much the topics tend to be coherent if we look at them (usually a
# tradeoff with exclusivity)
eval_heldout = map(topic_model, eval.heldout, heldout$missing),
residual = map(topic_model, checkResiduals, rba_minutes),
bound = map_dbl(topic_model, function(x) max(x$convergence$bound)),
lfact = map_dbl(topic_model, function(x) lfactorial(x$settings$dim$K)),
lbound = bound + lfact,
iterations = map_dbl(topic_model, function(x) length(x$convergence$bound)))
Put these diagnostics into a nice summary graph (again code is Julia’s originally).
%>%
k_result transmute(K,
`Lower bound` = lbound,
Residuals = map_dbl(residual, "dispersion"),
`Semantic coherence` = map_dbl(semantic_coherence, mean),
`Held-out likelihood` = map_dbl(eval_heldout, "expected.heldout")) %>%
gather(Metric, Value, -K) %>%
ggplot(aes(K, Value, color = Metric)) +
geom_line(show.legend = FALSE) +
facet_wrap(~Metric, scales = "free_y") +
labs(x = "K (number of topics)",
y = NULL,
title = "Model diagnostics by number of topics") +
theme_minimal()
In general we are looking for the max/min of parabolas, so our results suggest we may be best with some more topics (go to Julia’s post for to see another example: https://juliasilge.com/blog/evaluating-stm/.
# Have a look at that exclusivity to coherence tradeoff
%>%
k_result select(K, exclusivity, semantic_coherence) %>%
unnest() %>%
mutate(K = as.factor(K)) %>%
ggplot(aes(semantic_coherence, exclusivity)) +
geom_point() +
facet_wrap(vars(K)) +
labs(x = "Semantic coherence",
y = "Exclusivity",
title = "Comparing exclusivity and semantic coherence") +
theme_minimal()
Warning: `cols` is now required when using `unnest()`.
ℹ Please use `cols = c(exclusivity, semantic_coherence)`.
Although you’d probably want more, let’s just choose 10 topics for now. What we’re most interested in is getting the betas and gammas so that we can do our usual analysis.
<- k_result %>%
topic_model filter(K == 10) %>%
pull(topic_model) %>%
1]]
.[[
# Grab the betas - these are the probability of each term in each topic
<- broom::tidy(topic_model,
td_beta matrix = "beta")
# Grab the gammas - these are the probability of each word in each topic
<- tidy(topic_model,
td_gamma matrix = "gamma",
document_names = rownames(rba_minutes))
From here you could look at how the gammas and betas evolve or change using a statistical model. Or even sometimes just looking at them is interesting. Julia Silge has a bunch of code that makes very nice graphs and tables. One of the advantages of the STM package is that it makes it easier to include specific types of additional information. For instance, we know that over our time period there have been two governors: GRS and Phil Lowe. We could associate each date with who the governor is and then allow that to affect the prevalence of certain topics.
You can grab the files and folder set up from GitHub if you’d like.