Web Scraping

Sam Mason

Learning goals

Following this lecture, students should be able to:

  1. Read and navigate HTML code to find target elements
  2. Scrape simple, static HTML webpages

Packages

library(rvest) # for web scraping
library(tidyverse) # for everything else

A last resort

  • The ways in which we import (bring in) data

    • From files (local or web)

    • From direct database connection (coming soon)

    • Using web APIs

    • Web scraping (when all else fails)

Note: Import methods not ranked

Files, databases, and web APIs are all (more or less) equally good options — these three methods are not listed by order of importance or validity.

Is web scraping ethical?

  • Respect the wishes of the organization (terms of service)

  • Respect the wishes of the web developer (robots.txt)

Note: The robots.txt file

The robots.txt file provides instructions (written by the web developer) that outline where web scraping is not allowed to take place on a given website. It may also set limits on the frequency with which the scraping program can query the website servers (in the context of iterative scraping). Compliance with robots.txt is entirely optional.

Reading robots.txt

Note: Crawling vs. scraping

robots.txt files are created with web “crawling” programs (called “spiders”) in mind. These programs, used commonly by search engines like Google, Baidu (China), Bing, etc., index the internet by traveling from website to website using hyperlinks, sometimes scraping data along the way. In contrast, a web scraping program is generally designed to scrape specific values from a specific webpage or set of webpages. Despite being generally narrower in scope, it remains best practice for web scraping activities to respect the directives outlined in robots.txt.

  • robots.txt files are composed of only a few standard “directives”

  • User-agent: Used to authenticate (“ID”) scraping programs

  • Disallow: Indicates a URL path that is off-limits to scraping

  • Allow: Indicates a URL path is okay to scrape

Note: Non-standard directives

The three most common directives (listed above) give you information about where you can and cannot scrape. There are also a series of directives that relate to how you scrape, thought these are less common. The most relevant “non-standard” directive (for us) is Crawl-delay:, often interpreted as the number of seconds to wait between website server requests.

Directives Interpretation

User-agent: *

Disallow: /

The * means “all scraping programs” and the / means “the whole website”

User-agent: *

Disallow: /secrets/

“All scraping programs” are not allowed to scrape the “secrets” page of my website

User-agent: BadBot

Allow:

The scraping program “BadBot” is allowed to scrape “nothing”

How does web scraping work in R?

Note: Scraping static HTML

The process below relates to static HTML web scraping. Scraping dynamic webpages (i.e., those that use javascript to render data “on the fly”) is not covered in this course.

  1. Read the raw HTML code into R

  2. Identify elements in the HTML code that are linked to target data

  3. Extract target data from element text and attributes

  4. Transform (e.g., rectangle, tidy, etc.) the data as needed

Reading in HTML pages

  • We’ll be using rvest::read_html()

  • Let’s read in the Star Wars Films webpage from the {rvest} package website

read_html("https://rvest.tidyverse.org/articles/starwars.html")
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body>\n    <a href="#container" class="visually-hidden-focusable">S ...
  • All HTML webpages are composed of two major elements

    • The <head>, which contains metadata
    • The <body>, which contains the webpage contents

HTML elements

  • HTML is a markup language (that’s the “ML” part of the name)

  • Plain text is formatted and organized using character strings with special meaning

  • We call these special character strings elements

  • All elements are made up of one start tag and one end tag

  • The start tag can optionally contain attributes that modify how the element is rendered

<html>
  <head>
    <title>A simple website</title>
  </head>
  <body>
    <h1>Cool websites</h1>
    <a href="https://r4ds.hadley.nz">R for Data Science (2e)</a>
    <a href="https://www.tmwr.org">Tidy Modeling with R</a>
    <a href="https://clauswilke.com/dataviz/">Fundamentals of Data Visualization</a>
  </body>

Note: HTML is hierarchical

Notice that HTML elements are organized hierarchically like JSON objects and arrays. Identifying the patterns in this hierarchy is the key to web scraping. We tend to talk about nested elements using familial terms: <h1> is a child of <body>, which is a child of <html>.

  • Common HTML elements

    • Structural elements

      • <div>, <section> (organizational)

      • <h1>, <h2>, <h3>, etc. (headers)

      • <p> (paragraph), <ol> (ordered list)

    • Inline elements

      • <b> (bold), <i> (italics), <a> (hyperlink)
    • Media elements

      • <img> (image), <video> (video)

Identifying elements of interest

  • Star Warsfilms webpage

  • Begin by conceptualizing the data frame you want

    • Each film will get its own row

    • Columns for title, release date, director, and crawl text

Tip: Using your web browser to explore HTML

Your web browser includes a “developer” mode that will allow you to inspect the HTML code used to the build the webpage you’re viewing. Quick Google search will help you figure out how to enable developer mode in your web browser.

  • Each film is organized into its own <section>

Isolating data from elements

Motivating prompt

I’d like to scrape all of the film titles into an R vector.

read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section")
{xml_nodeset (7)}
[1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: ...
[2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nRelease ...
[3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased ...
[4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05 ...
[5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nRele ...
[6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: ...
[7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased:  ...
  • The html_elements() function creates a special list (called an XML nodeset) of all <section> elements
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section") |>
  html_element("h2")
{xml_nodeset (7)}
[1] <h2 data-id="1">\nThe Phantom Menace\n</h2>
[2] <h2 data-id="2">\nAttack of the Clones\n</h2>
[3] <h2 data-id="3">\nRevenge of the Sith\n</h2>
[4] <h2 data-id="4">\nA New Hope\n</h2>
[5] <h2 data-id="5">\nThe Empire Strikes Back\n</h2>
[6] <h2 data-id="6">\nReturn of the Jedi\n</h2>
[7] <h2 data-id="7">\nThe Force Awakens\n</h2>
  • The html_element() function operates on each element of the list returned by html_elements()

  • Unlike html_elements(), only a single HTML element is returned by html_element()

read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section") |>
  html_element("h2") |>
  html_text2()
[1] "The Phantom Menace"      "Attack of the Clones"   
[3] "Revenge of the Sith"     "A New Hope"             
[5] "The Empire Strikes Back" "Return of the Jedi"     
[7] "The Force Awakens"      
  • The html_text2() function pulls the plain text out of each <h2> element

In-class exercises

  1. Scrape the release date data from Star Wars Films into a character vector.
  2. Convert the character vector to a date vector.
# exercise 1
url <- "https://rvest.tidyverse.org/articles/starwars.html"
read_html(url) |>
  html_elements("section") |> # a list of <section> elements
  html_element("p") |> # pulls out the first <p> element
  html_text2() |> # pulls out the text from each <p> element
  str_remove("Released: ") -> # removes "Released: "
  sw_film_date

# exercise 2
sw_film_date <- ymd(sw_film_date)

Finding elements by attribute

Motivating prompt

I’d like to scrape all director data into an R vector.

  • HTML webpages are often accompanied by CSS files

  • The HTML provides the content and structure of the webpage

  • The CSS provides the ✨aesthetics

  • HTML elements are styled based on their id= and class= attributes

  • We can uses these attributes to be more precise in our element selection

read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section") |>
  html_element(".director") |> # all elements with class=director
  html_text2()
[1] "George Lucas"     "George Lucas"     "George Lucas"    
[4] "George Lucas"     "Irvin Kershner"   "Richard Marquand"
[7] "J. J. Abrams"    
  • Here the "." in ".director" indicates class=

  • If we wanted to find all elements where id=director (no such elements actually exist in this HTML) we’d write html_element(#director)

read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section") |>
  html_element("#director") |> # pulls all elements with id=director
  html_text2() # there are no HTML elements with id=director
[1] NA NA NA NA NA NA NA

In-class exercise

  1. Scrape the text crawl (the iconic wall of text that scrolls at the start of each Star Wars film) data from Star Wars Films into a character vector
url <- "https://rvest.tidyverse.org/articles/starwars.html"
read_html(url) |>
  html_elements("section") |>
  html_element(".crawl") |>
  html_text2() ->
  text_crawl

Combining vectors into tibble

  • Each one of our scraping efforts has yielded a vector

  • Let’s stitch all these vectors together into a rectangle

Code
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section") |>
  html_element("h2") |>
  html_text2() -> title
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section") |>
  html_element("p") |>
  html_text2() |>
  str_remove("Released: ") |> 
  ymd() -> release_date
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section") |>
  html_element(".director") |>
  html_text2() -> director
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
  html_elements("section") |>
  html_element(".crawl") |>
  html_text2() -> crawl
tibble(title, release_date, director, crawl)
# A tibble: 7 × 4
  title                   release_date director         crawl              
  <chr>                   <date>       <chr>            <chr>              
1 The Phantom Menace      1999-05-19   George Lucas     "Turmoil has engul…
2 Attack of the Clones    2002-05-16   George Lucas     "There is unrest i…
3 Revenge of the Sith     2005-05-19   George Lucas     "War! The Republic…
4 A New Hope              1977-05-25   George Lucas     "It is a period of…
5 The Empire Strikes Back 1980-05-17   Irvin Kershner   "It is a dark time…
6 Return of the Jedi      1983-05-25   Richard Marquand "Luke Skywalker ha…
7 The Force Awakens       2015-12-11   J. J. Abrams     "Luke Skywalker ha…

Case study: Countries of the World

Motivating prompt

I’d like to scrape this data into a data frame where each observation is a country. Columns will includes name, capital, population, and area_km_sq.

  • I’ll use developer mode in my web browser to identify the element containing each nation’s information

  • Each nation has it’s own <div> element with class="col-md-4 country"

  • I’ll use html_elements() to isolate each of these elements

url <- "http://www.scrapethissite.com/pages/simple/"
read_html(url) |>
  html_elements(".col-md-4.country") 
{xml_nodeset (250)}
 [1] <div class="col-md-4 country">\n                        <h3 class=" ...
 [2] <div class="col-md-4 country">\n                        <h3 class=" ...
 [3] <div class="col-md-4 country">\n                        <h3 class=" ...
 [4] <div class="col-md-4 country">\n                        <h3 class=" ...
 [5] <div class="col-md-4 country">\n                        <h3 class=" ...
 [6] <div class="col-md-4 country">\n                        <h3 class=" ...
 [7] <div class="col-md-4 country">\n                        <h3 class=" ...
 [8] <div class="col-md-4 country">\n                        <h3 class=" ...
 [9] <div class="col-md-4 country">\n                        <h3 class=" ...
[10] <div class="col-md-4 country">\n                        <h3 class=" ...
[11] <div class="col-md-4 country">\n                        <h3 class=" ...
[12] <div class="col-md-4 country">\n                        <h3 class=" ...
[13] <div class="col-md-4 country">\n                        <h3 class=" ...
[14] <div class="col-md-4 country">\n                        <h3 class=" ...
[15] <div class="col-md-4 country">\n                        <h3 class=" ...
[16] <div class="col-md-4 country">\n                        <h3 class=" ...
[17] <div class="col-md-4 country">\n                        <h3 class=" ...
[18] <div class="col-md-4 country">\n                        <h3 class=" ...
[19] <div class="col-md-4 country">\n                        <h3 class=" ...
[20] <div class="col-md-4 country">\n                        <h3 class=" ...
...

Tip: Multiple element classes

HTML elements may have multiple classes. This appears as a space in the class= attribute string (e.g., class="col-md-4 country"). You can select these elements using either class but, to avoid ambiguity, it’s a good idea to use both (as done above).

  • Within each <div class="col-md-4 country"> element, country name is stored in the <h3 class="country-name"> element
# two possible ways to scrape this data
## using the element name
read_html(url) |>
  html_elements(".col-md-4.country") |>
  html_element("h3") |>
  html_text2() -> country_names

## using the element attribute
read_html(url) |>
  html_elements(".col-md-4.country") |>
  html_element(".country-name") |>
  html_text2() -> country_names
  • Within each <div class="col-md-4 country"> element, there exists a <div class="country-info"> element which contains the following elements:

    • <span class="country-capital">

    • <span class="country-population">

    • <span class="country-population">

  • These three elements contain the remainder of the country information that we’re looking for

# scraping capitals
read_html(url) |>
  html_elements(".col-md-4.country") |>
  html_element(".country-capital") |>
  html_text2() -> country_capitals

# scraping 
read_html(url) |>
  html_elements(".col-md-4.country") |>
  html_element(".country-population") |>
  html_text2() |>
  as.numeric() -> country_populations

# scraping capitals
read_html(url) |>
  html_elements(".col-md-4.country") |>
  html_element(".country-area") |>
  html_text2() |>
  as.numeric() -> country_areas
  • Finally, we’ll stitch all these vectors together
country_data <- tibble(
  name = country_names,
  capital = country_capitals,
  population = country_populations,
  area = country_areas
)

country_data
# A tibble: 250 × 4
   name                 capital          population     area
   <chr>                <chr>                 <dbl>    <dbl>
 1 Andorra              Andorra la Vella      84000      468
 2 United Arab Emirates Abu Dhabi           4975593    82880
 3 Afghanistan          Kabul              29121286   647500
 4 Antigua and Barbuda  St. John's            86754      443
 5 Anguilla             The Valley            13254      102
 6 Albania              Tirana              2986952    28748
 7 Armenia              Yerevan             2968000    29800
 8 Angola               Luanda             13068161  1246700
 9 Antarctica           None                      0 14000000
10 Argentina            Buenos Aires       41343201  2766890
# ℹ 240 more rows

Scraping HTML tables

  • HTML tables, initiated by <table> and composed of <th> (table heading), <tr> (table row), and <td> (table data)

  • Wikipedia is an excellent source of HTML tables

  • Wikipedia’s list of sandwiches

  • The html_table() function can scrape this table automatically for us
url <- "https://en.wikipedia.org/wiki/List_of_sandwiches"
read_html(url) |>
  html_element(".wikitable.sortable") |>
  html_table() # can't scrape images, only characters
# A tibble: 226 × 4
   Name                  Image Origin                           Description
   <chr>                 <lgl> <chr>                            <chr>      
 1 American sub          NA    United States                    Traditiona…
 2 Bacon                 NA    United Kingdom                   Often eate…
 3 Bacon, egg and cheese NA    United States                    Breakfast …
 4 Bagel toast           NA    Israel                           Pressed, t…
 5 Baked bean            NA    United States (Boston area)      Canned bak…
 6 Bánh mì[1]            NA    Vietnam                          Filling is…
 7 Barbecue[2][3][4]     NA    United States (Texas, Tennessee… Served on …
 8 Barros Jarpa          NA    Chile                            Ham and ch…
 9 Barros Luco           NA    Chile                            Beef (usua…
10 Bauru                 NA    Brazil                           Melted che…
# ℹ 216 more rows

Scraping attributes

  • We’ve been using htlm_text2() to scrape the text associated with HTML elements (between the start and end tags)

  • Sometimes the information we’re after is found in the attributes of the HTML elements

    • <a href="https://example.com">Site</a>
      • We can scrape the link URL
    • <img src="/image.png" alt="a fat cat">
      • We can scrape the image path, or the alt text

Motivating prompt

I’d like to scrape the src= URL for each sandwich picture in the table from the previous slide, so that I can download them all to use as phone wallpapers.

url <- "https://en.wikipedia.org/wiki/List_of_sandwiches"
read_html(url) |>
  html_element(".wikitable.sortable") |>
  html_elements("img") |>
  html_attr("src") ->
  sandwich_photo_urls

sandwich_photo_urls[1:3]
[1] "//upload.wikimedia.org/wikipedia/commons/thumb/5/5f/Quiznos_sub_sandwich.jpg/120px-Quiznos_sub_sandwich.jpg"                          
[2] "//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Baconbutty.jpg/120px-Baconbutty.jpg"                                              
[3] "//upload.wikimedia.org/wikipedia/commons/thumb/8/85/Egg_and_cheese_breakfast_sandwich.jpg/120px-Egg_and_cheese_breakfast_sandwich.jpg"

Note: Downloading images

If you actually wanted to download all of these pictures of sandwiches, you could do so using the walk() functional as shown below:

walk(.x = sandwich_photo_urls,
     .f = \(x) download.file(
       url = str_c("https:", x),
       destfile = str_c("sandwich_photos/",
                        str_extract(x, "(?<=[-]).+$")),
       mode = "wb"))

In order for this code to work, you need the character vector of photo URLs, and a folder in your current working directory called “sandwich_photos”