Introduction

Preparation and session set up
Getting started

1 Scraping of dynamic web pages

2 Crawl single webpage

2.1 Dynamic web pages
2.2 Scrape information from XHTML

3 Follow links

4 Optional exercises

Citation & Session Info

References

Introduction

This tutorial introduces how to extract and process text data from social media sites, web pages, or other documents for later analysis. The entire R markdown document for the present tutorial can be downloaded here. This tutorial builds heavily on and uses materials from this tutorial on web crawling and scraping using R by Andreas Niekler and Gregor Wiedemann (see Wiedemann and Niekler 2017). The tutorial by Andreas Niekler and Gregor Wiedemann is more thorough, goes into more detail than this tutorial, and overs many more very useful text mining methods. An alternative approach for web crawling and scraping would be to use the RCrawler package (Khalil and Fakir 2017) which is not introduced here thought (inspecting the RCrawler package and its functions is, however, also highly recommended). For a more in-depth introduction to web crawling in scraping, Miner et al. (2012) is a very useful introduction.

NOTE
The code show below does not work at the moment - we are working on making it functional again and we hope that we will have a working version again in due time!

The automated download of HTML pages is called Crawling. The extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called Scraping (see Olston and Najork 2010).

Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# install packages
install.packages("rvest")
install.packages("readtext")
install.packages("webdriver")
webdriver::install_phantomjs()
# install klippy for copy-to-clipboard button in code chunks
remotes::install_github("rlesur/klippy")

If not done yet, please install the webdriver package for R and install the phantomJS headless browser. This needs to be done only once.

Now that we have installed the packages, we can activate them as shown below.

# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# load packages
library(tidyverse)
library(webdriver)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Once you have installed R, RStudio, and have also initiated the session by executing the code shown above, you are good to go.

Getting started

For web crawling and scraping, we use the package rvest and to extract text data from various formats such as PDF, DOC, DOCX and TXT files with the readtext package. The tasks described int his section consist of:

Download a single web page and extract its content
Extract links from a overview page and extract articles
Extract text data from PDF and other formats on disk

1 Scraping of dynamic web pages

Modern websites often do not contain the full content displayed in the browser in their corresponding source files which are served by the webserver. Instead, the browser loads additional content dynamically via javascript code contained in the original source file. To be able to scrape such content, we rely on a headless browser “phantomJS” which renders a site for a given URL for us, before we start the actual scraping, i.e. the extraction of certain identifiable elements from the rendered site.

Now we can start an instance of PhantomJS and create a new browser session that awaits to load URLs to render the corresponding websites.

require(webdriver)
pjs_instance <- run_phantomjs()
pjs_session <- Session$new(port = pjs_instance$port)

2 Crawl single webpage

In a first exercise, we will download a single web page from The Guardian and extract text together with relevant metadata such as the article date. Let’s define the URL of the article of interest and load the rvest package, which provides very useful functions for web crawling and scraping.

url <- "https://www.theguardian.com/world/2017/jun/26/angela-merkel-and-donald-trump-head-for-clash-at-g20-summit"
require("rvest")

A convenient method to download and parse a webpage provides the function read_html which accepts a URL as a parameter. The function downloads the page and interprets the html source code as an HTML / XML object.

2.1 Dynamic web pages

To make sure that we get the dynamically rendered HTML content of the website, we pass the original source code dowloaded from the URL to our PhantomJS session first, and the use the rendered source.

# load URL to phantomJS session
pjs_session$go(url)
# retrieve the rendered source code of the page
rendered_source <- pjs_session$getSource()
# parse the dynamically rendered source code
html_document <- read_html(rendered_source)

NOTICE: In case the website does not fetch or alter the to-be-scraped content dynamically, you can omit the PhantomJS webdriver and just download the the static HTML source code to retrieve the information from there. In this case, replace the following block of code with a simple call of html_document <- read_html(url) where the read_html() function downloads the unrendered page source code directly.

2.2 Scrape information from XHTML

HTML / XML objects are a structured representation of HTML / XML source code, which allows to extract single elements (headlines e.g. <h1>, paragraphs <p>, links <a>, …), their attributes (e.g. <a href="http://...">) or text wrapped in between elements (e.g. <p>my text...</p>). Elements can be extracted in XML objects with XPATH-expressions.

XPATH (see https://en.wikipedia.org/wiki/XPath) is a query language to select elements in XML-tree structures. We use it to select the headline element from the HTML page. The following xpath expression queries for first-order-headline elements h1, anywhere in the tree // which fulfill a certain condition [...], namely that the class attribute of the h1 element must contain the value content__headline.

The next expression uses R pipe operator %>%, which takes the input from the left side of the expression and passes it on to the function ion the right side as its first argument. The result of this function is either passed onto the next function, again via %>% or it is assigned to the variable, if it is the last operation in the pipe chain. Our pipe takes the html_document object, passes it to the html_node function, which extracts the first node fitting the given xpath expression. The resulting node object is passed to the html_text function which extracts the text wrapped in the h1-element.

title_xpath <- "//h1[contains(@class, 'content__headline')]"
title_text <- html_document %>%
  html_node(xpath = title_xpath) %>%
  html_text(trim = T)

Let’s see, what the title_text contains:

cat(title_text)

## NA

Now we modify the xpath expressions, to extract the article info, the paragraphs of the body text and the article date. Note that there are multiple paragraphs in the article. To extract not only the first, but all paragraphs we utilize the html_nodes function and glue the resulting single text vectors of each paragraph together with the paste0 function.

intro_xpath <- "//div[contains(@class, 'content__standfirst')]//p"
intro_text <- html_document %>%
  html_node(xpath = intro_xpath) %>%
  html_text(trim = T)
cat(intro_text)

## NA

body_xpath <- "//div[contains(@class, 'content__article-body')]//p"
body_text <- html_document %>%
  html_nodes(xpath = body_xpath) %>%
  html_text(trim = T) %>%
  paste0(collapse = "\n")

cat(body_text)

date_xpath <- "//time"
date_object <- html_document %>%
  html_node(xpath = date_xpath) %>%
  html_attr(name = "datetime") %>%
  as.Date()
cat(format(date_object, "%Y-%m-%d"))

## NA

The variables title_text, intro_text, body_text and date_object now contain the raw data for any subsequent text processing.

3 Follow links

Usually, we do not want download a single document, but a series of documents. In our second exercise, we want to download all Guardian articles tagged with “Angela Merkel”. Instead of a tag page, we could also be interested in downloading results of a site-search engine or any other link collection. The task is always two-fold: First, we download and parse the tag overview page to extract all links to articles of interest:

url <- "https://www.theguardian.com/world/angela-merkel"
# go to URL, download and render page
pjs_session$go(url)
rendered_source <- pjs_session$getSource()
# parse the source code into an XML object
html_document <- read_html(rendered_source)

Second, we download and scrape each individual article page. For this, we extract all href-attributes from a-elements fitting a certain CSS-class. To select the right contents via XPATH-selectors, you need to investigate the HTML-structure of your specific page. Modern browsers such as Firefox and Chrome support you in that task by a function called “Inspect Element” (or similar), available through a right-click on the page element.

links <- html_document %>%
  html_nodes(xpath = "//div[contains(@class, 'fc-item__container')]/a") %>%
  html_attr(name = "href")

Now, links contains a list of 20 hyperlinks to single articles tagged with Angela Merkel.

head(links, 3)

## [1] "https://www.theguardian.com/commentisfree/2021/sep/27/the-guardian-view-on-the-german-election-results-negotiating-a-new-era"
## [2] "https://www.theguardian.com/world/2021/sep/27/rival-spd-party-wins-district-held-angela-merkel-1990-anna-kassautzki"         
## [3] "https://www.theguardian.com/commentisfree/2021/sep/27/europe-angela-merkel-chancellor-germany-eu"

But stop! There is not only one page of links to tagged articles. If you have a look on the page in your browser, the tag overview page has several more than 60 sub pages, accessible via a paging navigator at the bottom. By clicking on the second page, we see a different URL-structure, which now contains a link to a specific paging number. We can use that format to create links to all sub pages by combining the base URL with the page numbers.

page_numbers <- 1:3
base_url <- "https://www.theguardian.com/world/angela-merkel?page="
paging_urls <- paste0(base_url, page_numbers)
# View first 3 urls
head(paging_urls, 3)

## [1] "https://www.theguardian.com/world/angela-merkel?page=1"
## [2] "https://www.theguardian.com/world/angela-merkel?page=2"
## [3] "https://www.theguardian.com/world/angela-merkel?page=3"

Now we can iterate over all URLs of tag overview pages, to collect more/all links to articles tagged with Angela Merkel. We iterate with a for-loop over all URLs and append results from each single URL to a vector of all links.

all_links <- NULL
for (url in paging_urls) {
  # download and parse single overview page
  pjs_session$go(url)
  rendered_source <- pjs_session$getSource()
  html_document <- read_html(rendered_source)
  # extract links to articles
  links <- html_document %>%
    html_nodes(xpath = "//div[contains(@class, 'fc-item__container')]/a") %>%
    html_attr(name = "href")
  
  # append links to vector of all links
  all_links <- c(all_links, links)
}

An effective way of programming is to encapsulate repeatedly used code in a specific function. This function then can be called with specific parameters, process something and return a result. We use this here, to encapsulate the downloading and parsing of a Guardian article given a specific URL. The code is the same as in our exercise 1 above, only that we combine the extracted texts and metadata in a data.frame and wrap the entire process in a function-Block.

scrape_guardian_article <- function(url) {
  
  pjs_session$go(url)
  rendered_source <- pjs_session$getSource()
  html_document <- read_html(rendered_source)
  
  title_xpath <- "//h1[contains(@class, 'content__headline')]"
  title_text <- html_document %>%
    html_node(xpath = title_xpath) %>%
    html_text(trim = T)
  
  intro_xpath <- "//div[contains(@class, 'content__standfirst')]//p"
  intro_text <- html_document %>%
    html_node(xpath = intro_xpath) %>%
    html_text(trim = T)
  
  body_xpath <- "//div[contains(@class, 'content__article-body')]//p"
  body_text <- html_document %>%
    html_nodes(xpath = body_xpath) %>%
    html_text(trim = T) %>%
    paste0(collapse = "\n")
  
  date_xpath <- "//time"
  date_text <- html_document %>%
    html_node(xpath = date_xpath) %>%
    html_attr(name = "datetime") %>%
    as.Date()
  
  article <- data.frame(
    url = url,
    date = date_text,
    title = title_text,
    body = paste0(intro_text, "\n", body_text)
  )
  
  return(article)
  
}

Now we can use that function scrape_guardian_article in any other part of our script. For instance, we can loop over each of our collected links. We use a running variable i, taking values from 1 to length(all_links) to access the single links in all_links and write some progress output.

all_articles <- data.frame()
for (i in 1:length(all_links)) {
  cat("Downloading", i, "of", length(all_links), "URL:", all_links[i], "\n")
  article <- scrape_guardian_article(all_links[i])
  # Append current article data.frame to the data.frame of all articles
  all_articles <- rbind(all_articles, article)
}

## Downloading 1 of 60 URL: https://www.theguardian.com/commentisfree/2021/sep/27/the-guardian-view-on-the-german-election-results-negotiating-a-new-era 
## Downloading 2 of 60 URL: https://www.theguardian.com/world/2021/sep/27/rival-spd-party-wins-district-held-angela-merkel-1990-anna-kassautzki 
## Downloading 3 of 60 URL: https://www.theguardian.com/commentisfree/2021/sep/27/europe-angela-merkel-chancellor-germany-eu

# View first articles
head(all_articles, 3)
# Write articles to disk
write.csv2(all_articles, file = "data/guardian_merkel.csv")

The last command write the extracted articles to a CSV-file in the data directory for any later use.

4 Optional exercises

Try to perform extraction of news articles from another web page, e.g. https://www.spiegel.de or https://www.nytimes.com.

For this, investigate the URL patterns of the page and look into the source code with the `inspect element’ functionality of your browser to find appropriate XPATH expressions.

Citation & Session Info

Schweinberger, Martin. 2021. Web Crawling and Scraping using R. Brisbane: The University of Queensland. url: https://slcladal.github.io/webcrawling.html (Version edition = {2021.09.29}).

@manual{schweinberger2021webc,
  author = {Schweinberger, Martin},
  title = {Web Crawling and Scraping using R},
  note = {https://slcladal.github.io/webcrawling.html},
  year = {2021},
  organization = "The University of Queensland, School of Languages and Cultures},
  address = {Brisbane},
  edition = {2021.09.29}
}

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rvest_1.0.1     webdriver_1.0.6 forcats_0.5.1   stringr_1.4.0  
##  [5] dplyr_1.0.7     purrr_0.3.4     readr_2.0.1     tidyr_1.1.3    
##  [9] tibble_3.1.4    ggplot2_3.3.5   tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        lubridate_1.7.10  png_0.1-7         ps_1.6.0         
##  [5] assertthat_0.2.1  digest_0.6.27     utf8_1.2.2        showimage_1.0.0  
##  [9] R6_2.5.1          cellranger_1.1.0  backports_1.2.1   reprex_2.0.1.9000
## [13] evaluate_0.14     httr_1.4.2        highr_0.9         pillar_1.6.2     
## [17] rlang_0.4.11      curl_4.3.2        readxl_1.3.1      rstudioapi_0.13  
## [21] callr_3.7.0       klippy_0.0.0.9500 rmarkdown_2.5     munsell_0.5.0    
## [25] broom_0.7.9       compiler_4.1.1    modelr_0.1.8      xfun_0.26        
## [29] pkgconfig_2.0.3   base64enc_0.1-3   htmltools_0.5.2   tidyselect_1.1.1 
## [33] fansi_0.5.0       crayon_1.4.1      tzdb_0.1.2        dbplyr_2.1.1     
## [37] withr_2.4.2       grid_4.1.1        jsonlite_1.7.2    gtable_0.3.0     
## [41] lifecycle_1.0.0   DBI_1.1.1         magrittr_2.0.1    scales_1.1.1     
## [45] debugme_1.1.0     cli_3.0.1         stringi_1.7.4     fs_1.5.0         
## [49] xml2_1.3.2        ellipsis_0.3.2    generics_0.1.0    vctrs_0.3.8      
## [53] tools_4.1.1       glue_1.4.2        hms_1.1.0         processx_3.5.2   
## [57] fastmap_1.1.0     yaml_2.2.1        colorspace_2.0-2  knitr_1.34       
## [61] haven_2.4.3

Back to HOME

References

Khalil, Salim, and Mohamed Fakir. 2017. “RCrawler: An R Package for Parallel Web Crawling and Scraping.” SoftwareX 6: 98–106.

Miner, Gary, John Elder IV, Andrew Fast, Thomas Hill, Robert Nisbet, and Dursun Delen. 2012. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications. Academic Press.

Olston, Christopher, and Marc Najork. 2010. Web Crawling. Now Publishers Inc.

Wiedemann, Gregor, and Andreas Niekler. 2017. “Hands-on: A Five Day Text Mining Course for Humanists and Social Scientists in R.” In Proceedings of the Workshop on Teaching NLP for Digital Humanities (Teach4DH2017), Berlin, Germany, September 12, 2017., 57–65. http://ceur-ws.org/Vol-1918/wiedemann.pdf.

Web Crawling and Scraping using R

Martin Schweinberger

2021-09-29