Following this lecture, students should be able to:
The ways in which we import (bring in) data
From files (local or web)
From direct database connection (coming soon)
Using web APIs
Web scraping (when all else fails)
Note: Import methods not ranked
Files, databases, and web APIs are all (more or less) equally good options — these three methods are not listed by order of importance or validity.
Yes — well, it’s not inherently illegal
In general, you’re safe if you follow these guidelines:
Scrape only publicly available data (i.e., does not require an account or subscription to access)
Scrape only facts — no original, creative works to avoid breaching copyright
Do not scrape personal information (e.g., names, dates of birth, gender, etc.)
Do not scrape for monetary gain
Respect the wishes of the organization (terms of service)
Respect the wishes of the web developer (robots.txt)
Note: The robots.txt file
The robots.txt file provides instructions (written by the web developer) that outline where web scraping is not allowed to take place on a given website. It may also set limits on the frequency with which the scraping program can query the website servers (in the context of iterative scraping). Compliance with robots.txt is entirely optional.
Note: Crawling vs. scraping
robots.txt files are created with web “crawling” programs (called “spiders”) in mind. These programs, used commonly by search engines like Google, Baidu (China), Bing, etc., index the internet by traveling from website to website using hyperlinks, sometimes scraping data along the way. In contrast, a web scraping program is generally designed to scrape specific values from a specific webpage or set of webpages. Despite being generally narrower in scope, it remains best practice for web scraping activities to respect the directives outlined in robots.txt.
robots.txt files are composed of only a few standard “directives”
User-agent:
Used to authenticate (“ID”) scraping programs
Disallow:
Indicates a URL path that is off-limits to scraping
Allow:
Indicates a URL path is okay to scrape
Note: Non-standard directives
The three most common directives (listed above) give you information about where you can and cannot scrape. There are also a series of directives that relate to how you scrape, thought these are less common. The most relevant “non-standard” directive (for us) is Crawl-delay:
, often interpreted as the number of seconds to wait between website server requests.
Directives | Interpretation |
---|---|
|
The * means “all scraping programs” and the / means “the whole website” |
|
“All scraping programs” are not allowed to scrape the “secrets” page of my website |
|
The scraping program “BadBot” is allowed to scrape “nothing” |
The robots.txt file is always stored at the root of the website
The list below provides a set of illustrative examples
https://www.gordon.edu/robots.txt (crawl delay)
https://www.reddit.com/robots.txt (no scraping allowed)
https://en.wikipedia.org/robots.txt (misbehaving agents)
https://www.linkedin.com/robots.txt (extensive directives)
Note: Scraping static HTML
The process below relates to static HTML web scraping. Scraping dynamic webpages (i.e., those that use javascript to render data “on the fly”) is not covered in this course.
Read the raw HTML code into R
Identify elements in the HTML code that are linked to target data
Extract target data from element text and attributes
Transform (e.g., rectangle, tidy, etc.) the data as needed
We’ll be using rvest::read_html()
Let’s read in the Star Wars Films webpage from the {rvest}
package website
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
[2] <body>\n <a href="#container" class="visually-hidden-focusable">S ...
All HTML webpages are composed of two major elements
<head>
, which contains metadata<body>
, which contains the webpage contentsHTML is a markup language (that’s the “ML” part of the name)
Plain text is formatted and organized using character strings with special meaning
We call these special character strings elements
All elements are made up of one start tag and one end tag
The start tag can optionally contain attributes that modify how the element is rendered
Note: HTML is hierarchical
Notice that HTML elements are organized hierarchically like JSON objects and arrays. Identifying the patterns in this hierarchy is the key to web scraping. We tend to talk about nested elements using familial terms: <h1>
is a child of <body>
, which is a child of <html>
.
Common HTML elements
Structural elements
<div>
, <section>
(organizational)
<h1>
, <h2>
, <h3>
, etc. (headers)
<p>
(paragraph), <ol>
(ordered list)
Inline elements
<b>
(bold), <i>
(italics), <a>
(hyperlink)Media elements
<img>
(image), <video>
(video)Begin by conceptualizing the data frame you want
Each film will get its own row
Columns for title, release date, director, and crawl text
Tip: Using your web browser to explore HTML
Your web browser includes a “developer” mode that will allow you to inspect the HTML code used to the build the webpage you’re viewing. Quick Google search will help you figure out how to enable developer mode in your web browser.
<section>
Motivating prompt
I’d like to scrape all of the film titles into an R vector.
{xml_nodeset (7)}
[1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: ...
[2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nRelease ...
[3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased ...
[4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05 ...
[5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nRele ...
[6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: ...
[7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: ...
html_elements()
function creates a special list (called an XML nodeset) of all <section>
elementsread_html("https://rvest.tidyverse.org/articles/starwars.html") |>
html_elements("section") |>
html_element("h2")
{xml_nodeset (7)}
[1] <h2 data-id="1">\nThe Phantom Menace\n</h2>
[2] <h2 data-id="2">\nAttack of the Clones\n</h2>
[3] <h2 data-id="3">\nRevenge of the Sith\n</h2>
[4] <h2 data-id="4">\nA New Hope\n</h2>
[5] <h2 data-id="5">\nThe Empire Strikes Back\n</h2>
[6] <h2 data-id="6">\nReturn of the Jedi\n</h2>
[7] <h2 data-id="7">\nThe Force Awakens\n</h2>
The html_element()
function operates on each element of the list returned by html_elements()
Unlike html_elements()
, only a single HTML element is returned by html_element()
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
html_elements("section") |>
html_element("h2") |>
html_text2()
[1] "The Phantom Menace" "Attack of the Clones"
[3] "Revenge of the Sith" "A New Hope"
[5] "The Empire Strikes Back" "Return of the Jedi"
[7] "The Force Awakens"
html_text2()
function pulls the plain text out of each <h2>
element# exercise 1
url <- "https://rvest.tidyverse.org/articles/starwars.html"
read_html(url) |>
html_elements("section") |> # a list of <section> elements
html_element("p") |> # pulls out the first <p> element
html_text2() |> # pulls out the text from each <p> element
str_remove("Released: ") -> # removes "Released: "
sw_film_date
# exercise 2
sw_film_date <- ymd(sw_film_date)
Motivating prompt
I’d like to scrape all director data into an R vector.
HTML webpages are often accompanied by CSS files
The HTML provides the content and structure of the webpage
The CSS provides the ✨aesthetics✨
HTML elements are styled based on their id=
and class=
attributes
We can uses these attributes to be more precise in our element selection
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
html_elements("section") |>
html_element(".director") |> # all elements with class=director
html_text2()
[1] "George Lucas" "George Lucas" "George Lucas"
[4] "George Lucas" "Irvin Kershner" "Richard Marquand"
[7] "J. J. Abrams"
Here the "."
in ".director"
indicates class=
If we wanted to find all elements where id=director
(no such elements actually exist in this HTML) we’d write html_element(#director)
Each one of our scraping efforts has yielded a vector
Let’s stitch all these vectors together into a rectangle
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
html_elements("section") |>
html_element("h2") |>
html_text2() -> title
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
html_elements("section") |>
html_element("p") |>
html_text2() |>
str_remove("Released: ") |>
ymd() -> release_date
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
html_elements("section") |>
html_element(".director") |>
html_text2() -> director
read_html("https://rvest.tidyverse.org/articles/starwars.html") |>
html_elements("section") |>
html_element(".crawl") |>
html_text2() -> crawl
# A tibble: 7 × 4
title release_date director crawl
<chr> <date> <chr> <chr>
1 The Phantom Menace 1999-05-19 George Lucas "Turmoil has engul…
2 Attack of the Clones 2002-05-16 George Lucas "There is unrest i…
3 Revenge of the Sith 2005-05-19 George Lucas "War! The Republic…
4 A New Hope 1977-05-25 George Lucas "It is a period of…
5 The Empire Strikes Back 1980-05-17 Irvin Kershner "It is a dark time…
6 Return of the Jedi 1983-05-25 Richard Marquand "Luke Skywalker ha…
7 The Force Awakens 2015-12-11 J. J. Abrams "Luke Skywalker ha…
Motivating prompt
I’d like to scrape this data into a data frame where each observation is a country. Columns will includes name
, capital
, population
, and area_km_sq
.
I’ll use developer mode in my web browser to identify the element containing each nation’s information
Each nation has it’s own <div>
element with class="col-md-4 country"
I’ll use html_elements()
to isolate each of these elements
url <- "http://www.scrapethissite.com/pages/simple/"
read_html(url) |>
html_elements(".col-md-4.country")
{xml_nodeset (250)}
[1] <div class="col-md-4 country">\n <h3 class=" ...
[2] <div class="col-md-4 country">\n <h3 class=" ...
[3] <div class="col-md-4 country">\n <h3 class=" ...
[4] <div class="col-md-4 country">\n <h3 class=" ...
[5] <div class="col-md-4 country">\n <h3 class=" ...
[6] <div class="col-md-4 country">\n <h3 class=" ...
[7] <div class="col-md-4 country">\n <h3 class=" ...
[8] <div class="col-md-4 country">\n <h3 class=" ...
[9] <div class="col-md-4 country">\n <h3 class=" ...
[10] <div class="col-md-4 country">\n <h3 class=" ...
[11] <div class="col-md-4 country">\n <h3 class=" ...
[12] <div class="col-md-4 country">\n <h3 class=" ...
[13] <div class="col-md-4 country">\n <h3 class=" ...
[14] <div class="col-md-4 country">\n <h3 class=" ...
[15] <div class="col-md-4 country">\n <h3 class=" ...
[16] <div class="col-md-4 country">\n <h3 class=" ...
[17] <div class="col-md-4 country">\n <h3 class=" ...
[18] <div class="col-md-4 country">\n <h3 class=" ...
[19] <div class="col-md-4 country">\n <h3 class=" ...
[20] <div class="col-md-4 country">\n <h3 class=" ...
...
Tip: Multiple element classes
HTML elements may have multiple classes. This appears as a space in the class=
attribute string (e.g., class="col-md-4 country"
). You can select these elements using either class but, to avoid ambiguity, it’s a good idea to use both (as done above).
<div class="col-md-4 country">
element, country name is stored in the <h3 class="country-name">
element# two possible ways to scrape this data
## using the element name
read_html(url) |>
html_elements(".col-md-4.country") |>
html_element("h3") |>
html_text2() -> country_names
## using the element attribute
read_html(url) |>
html_elements(".col-md-4.country") |>
html_element(".country-name") |>
html_text2() -> country_names
Within each <div class="col-md-4 country">
element, there exists a <div class="country-info">
element which contains the following elements:
<span class="country-capital">
<span class="country-population">
<span class="country-population">
These three elements contain the remainder of the country information that we’re looking for
# scraping capitals
read_html(url) |>
html_elements(".col-md-4.country") |>
html_element(".country-capital") |>
html_text2() -> country_capitals
# scraping
read_html(url) |>
html_elements(".col-md-4.country") |>
html_element(".country-population") |>
html_text2() |>
as.numeric() -> country_populations
# scraping capitals
read_html(url) |>
html_elements(".col-md-4.country") |>
html_element(".country-area") |>
html_text2() |>
as.numeric() -> country_areas
country_data <- tibble(
name = country_names,
capital = country_capitals,
population = country_populations,
area = country_areas
)
country_data
# A tibble: 250 × 4
name capital population area
<chr> <chr> <dbl> <dbl>
1 Andorra Andorra la Vella 84000 468
2 United Arab Emirates Abu Dhabi 4975593 82880
3 Afghanistan Kabul 29121286 647500
4 Antigua and Barbuda St. John's 86754 443
5 Anguilla The Valley 13254 102
6 Albania Tirana 2986952 28748
7 Armenia Yerevan 2968000 29800
8 Angola Luanda 13068161 1246700
9 Antarctica None 0 14000000
10 Argentina Buenos Aires 41343201 2766890
# ℹ 240 more rows
HTML tables, initiated by <table>
and composed of <th>
(table heading), <tr>
(table row), and <td>
(table data)
Wikipedia is an excellent source of HTML tables
html_table()
function can scrape this table automatically for usurl <- "https://en.wikipedia.org/wiki/List_of_sandwiches"
read_html(url) |>
html_element(".wikitable.sortable") |>
html_table() # can't scrape images, only characters
# A tibble: 226 × 4
Name Image Origin Description
<chr> <lgl> <chr> <chr>
1 American sub NA United States Traditiona…
2 Bacon NA United Kingdom Often eate…
3 Bacon, egg and cheese NA United States Breakfast …
4 Bagel toast NA Israel Pressed, t…
5 Baked bean NA United States (Boston area) Canned bak…
6 Bánh mì[1] NA Vietnam Filling is…
7 Barbecue[2][3][4] NA United States (Texas, Tennessee… Served on …
8 Barros Jarpa NA Chile Ham and ch…
9 Barros Luco NA Chile Beef (usua…
10 Bauru NA Brazil Melted che…
# ℹ 216 more rows
We’ve been using htlm_text2()
to scrape the text associated with HTML elements (between the start and end tags)
Sometimes the information we’re after is found in the attributes of the HTML elements
<a href="https://example.com">Site</a>
<img src="/image.png" alt="a fat cat">
Motivating prompt
I’d like to scrape the src=
URL for each sandwich picture in the table from the previous slide, so that I can download them all to use as phone wallpapers.
url <- "https://en.wikipedia.org/wiki/List_of_sandwiches"
read_html(url) |>
html_element(".wikitable.sortable") |>
html_elements("img") |>
html_attr("src") ->
sandwich_photo_urls
sandwich_photo_urls[1:3]
[1] "//upload.wikimedia.org/wikipedia/commons/thumb/5/5f/Quiznos_sub_sandwich.jpg/120px-Quiznos_sub_sandwich.jpg"
[2] "//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Baconbutty.jpg/120px-Baconbutty.jpg"
[3] "//upload.wikimedia.org/wikipedia/commons/thumb/8/85/Egg_and_cheese_breakfast_sandwich.jpg/120px-Egg_and_cheese_breakfast_sandwich.jpg"
Note: Downloading images
If you actually wanted to download all of these pictures of sandwiches, you could do so using the walk()
functional as shown below:
In order for this code to work, you need the character vector of photo URLs, and a folder in your current working directory called “sandwich_photos”
DSC 210 Data Wrangling