Data Wrangling with SQL

Sam Mason

Learning goals

Following this lecture, you should be able to:

  • Write basic SQL queries to transformation data stored in a relational SQL database

Packages

# we don't need any R packages during this lecture

The story so far

All of our relational database work thus far has been in small, flat-file databases (FFDBs)

  • Small datasets stored locally

  • Application specific (.rds) or common file formats (.csv)

  • Tables stored on disk and pulled into memory

    • library(nycflights13)

    • read_csv()

Limitations of flat-file DBs

  • Computer memory

    • Most modern personal computers have ~ 8GB of RAM

    • We run into issues when our FFDB won’t fit into memory

  • Database integrity

    • FFDBs have no idea that they’re relational databases

    • Updates to specific files can inadvertently break relations

  • Database access

    • FFDBs are inherently local — collaboration is difficult
  • Database security

    • Shared FFDBs have limited security options

Relational database management systems

Designed for building and querying databases in SQL

Structured query language

SQL is currently the industry standard for interfacing with relational databases

  • Developed about 50 years ago

  • Built on relational algebra and tuple (relational) calculus

  • Assumes rectangular data (organized into rows and columns)

There also exist “NoSQL” database management systems

  • MongoDB, for example, is often used to store JSON objects

Connecting to our MySQL server

  1. Open a terminal instance on your computer
  2. Run ssh dsc210@10.100.50.88
  3. Enter password “tidyverse”
  4. Run mysql -u dsc210 -p
  5. Enter password “tidyverse”
  6. Run USE lahman;

Note: accessing the SQL server

You can only access the SQL server when on Gordon’s network.

The lahman database

  • Maintained by Sean Lahman, journalist and baseball enthusiast

  • Our copy of the database is made up of more than two dozen tables (1871 through 2019)

    • people: all professional baseball players

    • batting: batting statistics (e.g., at bats, hits, home runs)

    • pitching: pitching statistics (e.g., pitch count, strikes, ERA)

    • fielding: fielding statistics (e.g., errors, double plays)

Important: observation definitions

In the lahman database, the definition of each observations changes depending on the table.

  • people: each observation is a professional baseball player

  • batting, pitching, fielding: each observation is a particular “stint” in a particular year for a particular player

SQL tools for exploring a database

  • SHOW TABLES; will list all datasets in the lahman database

  • DESCRIBE people; will give you a breakdown of people, indicating key columns, NULL columns, etc.

  • You’ll often have a schema to help you, but not always (may God have mercy on your soul)

SELECT…FROM queries

Every SQL query requires the two keywords (functions) SELECT and FROM

SQL

SELECT nameFirst, nameLast
FROM people;

{dplyr}

people |>
  select(nameFirst, nameLast)

SQL order of evaluation

  1. FROM people
  2. SELECT nameFirst, nameLast

Tip: selecting all columns

You can select all columns from a given table using *. For example, SELECT * FROM people will return the entire people table.

Filtering with WHERE

SQL

SELECT nameFirst, nameLast
FROM people
WHERE birthYear > 1993;

{dplyr}

people |>
  select(nameFirst, nameLast) |>
  filter(birthYear > 1993)

Note: ending SQL queries

Use ; to indicate the end of a SQL query.

SQL order of evaluation

  1. FROM people
  2. WHERE birthYear > 1993
  3. SELECT nameFirst, nameLast

Tip: using Boolean and logical operators

The WHERE keyword supports the following operators:

  • equal to =

  • not equal to <> or !=

  • greater than >; greater than or equal to >=

  • less than <; less than or equal to <=

  • AND; OR

In-class exercises

Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;.

Tip: string literals in MySQL

In MySQL you can create string literals using either single or double quotes, but single quotes are more commonly used.

  1. Find the names of all players shorter than 6 ft (72 inches; height is recorded in inches)
  2. Find all records of pitchers winning at least 20 games in a stint (the W column records wins in a single stint)
  3. Find all records of batters hitting more than 45 home runs (HR) and at least 100 RBIs in a single stint
  4. Find all records of Red Sox ('BOS') fielders not making an error (E) in a stint
  5. Find all baseball parks in California, Florida, or Texas
    • Use SELECT DISTINCT state FROM parks; to see how state data is stored

    • DISTINCT works just like dplyr::distinct(), giving you only the unique values

# exercise 1
SELECT nameFirst, nameLast, height
FROM people
WHERE height < 72;

# exercise 2
SELECT playerID, W
FROM pitching
WHERE W >= 20;

# exercise 3
SELECT playerID, HR, RBI
FROM batting
WHERE HR > 45 AND RBI >= 100;

# exercise 4
SELECT playerID, teamID, E
FROM fielding
WHERE teamID = 'BOS' AND E = 0;

# exercise 5
SELECT parkname, state
FROM parks
WHERE state = 'CA' OR state = 'FL' OR state = 'TX';

Filtering ranges and sets

  • Filtering within a range (inclusive) of values with BETWEEN

SQL

SELECT nameFirst, nameLast, birthYear
FROM people
WHERE birthYear BETWEEN 1900 AND 2000
LIMIT 10;

{dplyr}

people |>
  select(nameFirst, nameLast, birthYear) |>
  filter(between(birthYear, 1900, 2000)) |>
  head(10)
  • Filtering within sets of possible values (multiple OR) with IN

SQL

SELECT nameFirst, nameLast, birthState
FROM people
WHERE birthState IN ('ME', 'NH', 'VT', 'MA', 'CT', 'RI')
LIMIT 10;

{dplyr}

people |>
  select(nameFirst, nameLast, birthState) |>
  filter(birthState %in% c("ME", "NH", "VT", "MA", "CT", "RI")) |>
  head(10)

Basic filtering with strings

  • We can filter for observations containing particular string patterns using SQL’s % and _ wilcards

    • % represents 0 or more characters

    • _ represents a single character

  • Wildcards are used in conjunction with the LIKE keyword

  • Let’s find all players with a last name that starts with “A”

SQL

SELECT nameLast
FROM people
WHERE nameLast LIKE 'A%'
LIMIT 10;

{dplyr}

people |>
  select(nameLast) |>
  filter(str_detect(nameLast, "^A")) |>
  head(10)

Tip: limiting results

When your query returns many rows, you may want to use the LIMIT keyword to look at just the first several.

  • We could also return all players from towns ending in “ville”

SQL

SELECT nameLast
FROM people
WHERE nameLast LIKE '%ville'
LIMIT 10;

{dplyr}

people |>
  select(nameLast) |>
  filter(str_detect(nameLast, "ville$")) |>
  head(10)
  • Or find all players with a four letter first name ending in “ill” (e.g., “Bill”, “Will”, etc.)

SQL

SELECT nameFirst
FROM people
WHERE nameFirst LIKE '_ill'
LIMIT 10;

{dplyr}

people |>
  select(nameFirst) |>
  filter(str_detect(nameFirst, "^.ill$")) |>
  head(10)

In-class exercises

Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;.

  1. Find all baseball stadiums (parks) on the West Coast (California, Oregon, Washington)
  2. Find all pitchers active during the 20th century (use BETWEEN) who had more than 20 losses
  3. Find all unique (SELECT DISTINCT) player last names that end in “son”
  4. Find all stadiums with “Ball” or “ball” anywhere in the name
  5. Find all unique player first names where the second to last letter is “e”
# exercise 6
SELECT parkname, state
FROM parks
WHERE state IN ('WA', 'OR', 'CA');

# exercise 7
SELECT playerID, year, L
FROM pitching
WHERE yearID BETWEEN 1900 AND 2000 AND L > 20;

# exercise 8
SELECT DISTINCT nameLast
FROM people
WHERE nameLast LIKE '%son';

# exercise 9
SELECT parkname
FROM parks
WHERE parkname LIKE '%Ball%' OR parkname LIKE '%ball%';

# exercise 10
SELECT DISTINCT nameFirst
FROM people
WHERE nameFirst LIKE '%e_';

Sorting rows with ORDER BY

  • Just like dplyr::arrange(), SQL’s ORDER BY keyword is used to sort rows in either ascending or descending order

SQL

SELECT salary, yearID
FROM salaries
WHERE salary > 20000000
ORDER BY yearID
LIMIT 10;

{dplyr}

salaries |>
  select(salary, yearID) |>
  filter(salaries > 20000000) |>
  arrange(yearID) |>
  head(10)

SQL order of evaluation

  1. FROM salaries
  2. WHERE salary > 20000000
  3. SELECT salary, yearID
  4. ORDER BY yearID
  5. LIMIT 10

Tip: descending order

The DESC keyword can be used in conjunction with ORDER BY to sort rows in descending order. For example, ORDER BY yearID DESC.

Summarizing fields

  • Unlike {dplyr}, SQL doesn’t have a dedicated SUMMARIZE keyword

  • We can summarize by calling functions directly on column names in the SELECT line

SQL

SELECT AVG(H)
FROM batting
WHERE AB > 200;

{dplyr}

batting |>
  filter(AB > 200) |>
  summarize(`AVG(H)` = mean(H, na.rm = T))

Note: summarize order of operations

The summary functions (AVG(), SUM(), MIN(), COUNT(), etc.) are executed near the end of the order of evaluation along with the SELECT keyword.

Aliasing columns with AS

  • Let’s calculate a bunch of career statistics for Hank Aaron, one of the greatest batters in Major League Baseball history

SQL

SELECT SUM(H), SUM(HR), SUM(RBI)
FROM batting
WHERE playerID = 'aaronha01';

{dplyr}

batting |>
  filter(playerID == "aaronha01") |>
  summarize(`SUM(H)` = sum(H, na.rm = TRUE),
            `SUM(HR)` = sum(HR, na.rm = TRUE),
            `SUM(RBI)` = sum(RBI, na.rm = TRUE))
  • We can make the summarized column names more descriptive by using AS to create column aliases

SQL

SELECT SUM(H) AS career_hits,
  SUM(HR) AS career_hrs,
  SUM(RBI) AS career_rbis
FROM batting
WHERE playerID = 'aaronha01';

{dplyr}

batting |>
  filter(playerID == "aaronha01") |>
  summarize(career_hits = sum(H, na.rm = TRUE),
            career_hrs = sum(HR, na.rm = TRUE),
            career_rbis = sum(RBI, na.rm = TRUE))

Grouped summaries with GROUP BY

  • These career statistics are impressive, but Hank Aaron played for more than two decades. What do these batting numbers look like on an per year basis?

SQL

SELECT yearID AS year,
  AVG(H) AS avg_hits_per_year,
  AVG(HR) AS avg_hrs_per_year,
  AVG(RBI) AS avg_rbis_per_year,
FROM batting
WHERE playerID = 'aaronha01'
GROUP BY yearID

{dplyr}

batting |>
  filter(playerID == "aaronha01") |>
  group_by(yearID) |>
  summarize(avg_hits_per_year = sum(H, na.rm = TRUE),
            avg_hrs_per_year = sum(HR, na.rm = TRUE),
            avg_rbis_per_year = sum(RBI, na.rm = TRUE))

SQL order of evaluation

  1. FROM batting
  2. WHERE playerID = 'aaronha01'
  3. GROUP BY yearID
  4. SELECT...

In-class exercises

Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;.

  1. In the last decade (ending in 2019), which pitcher holds the record for most strikeouts (SO) in a single stint?

  2. How many players bat and throw with opposite hands? Use COUNT(*) as a summary function to give the total number of rows in the filtered dataset.

  3. Which position (POS; catcher, first baseman, outfielder, etc.) had the highest number of fielding errors (E) in 2019?

  4. Between 1980 and 2009 (inclusive), which team has the highest win-loss differential (total wins minus total losses over that 30 year span). SQL supports arithmetic operators in the SELECT clause to summarize.

# exercise 11
SELECT playerID, SUM(SO) as total_so
FROM pitching
WHERE yearID BETWEEN 2010 AND 2019
GROUP BY playerID
ORDER BY total_so DESC
LIMIT 1;

# exercise 12
SELECT COUNT(*) AS ambidextrous
FROM people
WHERE (bats = 'L' AND throws = 'R') OR (bats = 'R' AND throws = 'L');

# exercise 13
SELECT POS AS position, SUM(E) AS total_errors
FROM fielding
WHERE yearID = 2019
GROUP BY POS
ORDER BY total_errors DESC;

# exercise 14
SELECT teamID, (SUM(W) - SUM(L)) AS differential
FROM teams
WHERE yearID BETWEEN 1980 AND 2009
GROUP BY teamID
ORDER BY differential DESC;

Joining in SQL

  • Let’s join together Babe Ruth’s batting and pitching stats to return all observations where he logged time doing both

SQL

SELECT *
FROM batting AS b
JOIN pitching AS p
ON b.playerID = p.playerID
  AND b.yearID = p.yearID
  AND b.stint = p.stint
WHERE b.playerID = 'ruthba01'

{dplyr}

batting |>
  filter(playerID = 'ruthba01') |>
  inner_join(
    y = pitching,
    by = join_by(playerID = playerID,
                 yearID = yearID,
                 stint = stint))

SQL order of evaluation

  1. FROM people
  2. WHERE people.playerID = 'scherma01'
  3. JOIN pitching ON people.playerID = pitching.playerID
  4. SELECT DISTINCT teamID

Note: optimized order of evaluation

The orders of evaluation that we’ve been looking at throughout this lecture are conceptual. In reality, the SQL software has the freedom to change the order of clauses (lines) in ways that optimize the retrieval of data. In this example, it is likely most efficient to filter people first so that the JOIN has fewer rows to operate on, but there is no guarantee that WHERE will always be evaluated before JOIN in all queries that you will write.

  • Do players with higher salaries tend to get more awards?

SQL

SELECT s.playerID, AVG(salary) AS avg_salary, COUNT(awardID) AS awards
FROM salaries AS s
JOIN awardsplayers AS a
ON s.playerID = a.playerID
GROUP BY playerID
ORDER BY avg_salary DESC
LIMIT 20;

{dplyr}

salaries |>
  inner_join(y = awardsplayers, by = "playerID") |>
  group_by(playerID) |>
  summarize(avg_salary = mean(salary, na.rm = TRUE),
            awards = sum(!is.na(awardID))) |>
  arrange(desc(salary)) |>
  head(20)

Note: other types of joins

MySQL supports all the same mutating joins that {dplyr} does, including LEFT JOIN and RIGHT JOIN.