Following this lecture, you should be able to:
All of our relational database work thus far has been in small, flat-file databases (FFDBs)
Small datasets stored locally
Application specific (.rds) or common file formats (.csv)
Tables stored on disk and pulled into memory
library(nycflights13)
read_csv()
Computer memory
Most modern personal computers have ~ 8GB of RAM
We run into issues when our FFDB won’t fit into memory
Database integrity
FFDBs have no idea that they’re relational databases
Updates to specific files can inadvertently break relations
Database access
Database security



Designed for building and querying databases in SQL
SQL is currently the industry standard for interfacing with relational databases
Developed about 50 years ago
Built on relational algebra and tuple (relational) calculus
Assumes rectangular data (organized into rows and columns)
There also exist “NoSQL” database management systems
ssh dsc210@10.100.50.88mysql -u dsc210 -pUSE lahman;Note: accessing the SQL server
You can only access the SQL server when on Gordon’s network.
lahman databaseMaintained by Sean Lahman, journalist and baseball enthusiast
Our copy of the database is made up of more than two dozen tables (1871 through 2019)
people: all professional baseball players
batting: batting statistics (e.g., at bats, hits, home runs)
pitching: pitching statistics (e.g., pitch count, strikes, ERA)
fielding: fielding statistics (e.g., errors, double plays)
Important: observation definitions
In the lahman database, the definition of each observations changes depending on the table.
people: each observation is a professional baseball player
batting, pitching, fielding: each observation is a particular “stint” in a particular year for a particular player
SHOW TABLES; will list all datasets in the lahman database
DESCRIBE people; will give you a breakdown of people, indicating key columns, NULL columns, etc.
You’ll often have a schema to help you, but not always (may God have mercy on your soul)
SELECT…FROM queriesEvery SQL query requires the two keywords (functions) SELECT and FROM
SQL order of evaluation
FROM peopleSELECT nameFirst, nameLastTip: selecting all columns
You can select all columns from a given table using *. For example, SELECT * FROM people will return the entire people table.
WHERENote: ending SQL queries
Use ; to indicate the end of a SQL query.
SQL order of evaluation
FROM peopleWHERE birthYear > 1993SELECT nameFirst, nameLastTip: using Boolean and logical operators
The WHERE keyword supports the following operators:
equal to =
not equal to <> or !=
greater than >; greater than or equal to >=
less than <; less than or equal to <=
AND; OR
Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;.
Tip: string literals in MySQL
In MySQL you can create string literals using either single or double quotes, but single quotes are more commonly used.
height is recorded in inches)W column records wins in a single stint)HR) and at least 100 RBIs in a single stint'BOS') fielders not making an error (E) in a stintUse SELECT DISTINCT state FROM parks; to see how state data is stored
DISTINCT works just like dplyr::distinct(), giving you only the unique values
# exercise 1
SELECT nameFirst, nameLast, height
FROM people
WHERE height < 72;
# exercise 2
SELECT playerID, W
FROM pitching
WHERE W >= 20;
# exercise 3
SELECT playerID, HR, RBI
FROM batting
WHERE HR > 45 AND RBI >= 100;
# exercise 4
SELECT playerID, teamID, E
FROM fielding
WHERE teamID = 'BOS' AND E = 0;
# exercise 5
SELECT parkname, state
FROM parks
WHERE state = 'CA' OR state = 'FL' OR state = 'TX';BETWEENSQL
OR) with INWe can filter for observations containing particular string patterns using SQL’s % and _ wilcards
% represents 0 or more characters
_ represents a single character
Wildcards are used in conjunction with the LIKE keyword
Let’s find all players with a last name that starts with “A”
Tip: limiting results
When your query returns many rows, you may want to use the LIMIT keyword to look at just the first several.
Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;.
BETWEEN) who had more than 20 lossesSELECT DISTINCT) player last names that end in “son”# exercise 6
SELECT parkname, state
FROM parks
WHERE state IN ('WA', 'OR', 'CA');
# exercise 7
SELECT playerID, year, L
FROM pitching
WHERE yearID BETWEEN 1900 AND 2000 AND L > 20;
# exercise 8
SELECT DISTINCT nameLast
FROM people
WHERE nameLast LIKE '%son';
# exercise 9
SELECT parkname
FROM parks
WHERE parkname LIKE '%Ball%' OR parkname LIKE '%ball%';
# exercise 10
SELECT DISTINCT nameFirst
FROM people
WHERE nameFirst LIKE '%e_';ORDER BYdplyr::arrange(), SQL’s ORDER BY keyword is used to sort rows in either ascending or descending orderSQL order of evaluation
FROM salariesWHERE salary > 20000000SELECT salary, yearIDORDER BY yearIDLIMIT 10Tip: descending order
The DESC keyword can be used in conjunction with ORDER BY to sort rows in descending order. For example, ORDER BY yearID DESC.
Unlike {dplyr}, SQL doesn’t have a dedicated SUMMARIZE keyword
We can summarize by calling functions directly on column names in the SELECT line
Note: summarize order of operations
The summary functions (AVG(), SUM(), MIN(), COUNT(), etc.) are executed near the end of the order of evaluation along with the SELECT keyword.
ASAS to create column aliasesGROUP BYSQL
SQL order of evaluation
FROM battingWHERE playerID = 'aaronha01'GROUP BY yearIDSELECT...Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;.
In the last decade (ending in 2019), which pitcher holds the record for most strikeouts (SO) in a single stint?
How many players bat and throw with opposite hands? Use COUNT(*) as a summary function to give the total number of rows in the filtered dataset.
Which position (POS; catcher, first baseman, outfielder, etc.) had the highest number of fielding errors (E) in 2019?
Between 1980 and 2009 (inclusive), which team has the highest win-loss differential (total wins minus total losses over that 30 year span). SQL supports arithmetic operators in the SELECT clause to summarize.
# exercise 11
SELECT playerID, SUM(SO) as total_so
FROM pitching
WHERE yearID BETWEEN 2010 AND 2019
GROUP BY playerID
ORDER BY total_so DESC
LIMIT 1;
# exercise 12
SELECT COUNT(*) AS ambidextrous
FROM people
WHERE (bats = 'L' AND throws = 'R') OR (bats = 'R' AND throws = 'L');
# exercise 13
SELECT POS AS position, SUM(E) AS total_errors
FROM fielding
WHERE yearID = 2019
GROUP BY POS
ORDER BY total_errors DESC;
# exercise 14
SELECT teamID, (SUM(W) - SUM(L)) AS differential
FROM teams
WHERE yearID BETWEEN 1980 AND 2009
GROUP BY teamID
ORDER BY differential DESC;SQL order of evaluation
FROM peopleWHERE people.playerID = 'scherma01'JOIN pitching ON people.playerID = pitching.playerIDSELECT DISTINCT teamIDNote: optimized order of evaluation
The orders of evaluation that we’ve been looking at throughout this lecture are conceptual. In reality, the SQL software has the freedom to change the order of clauses (lines) in ways that optimize the retrieval of data. In this example, it is likely most efficient to filter people first so that the JOIN has fewer rows to operate on, but there is no guarantee that WHERE will always be evaluated before JOIN in all queries that you will write.
SQL
Note: other types of joins
MySQL supports all the same mutating joins that {dplyr} does, including LEFT JOIN and RIGHT JOIN.
DSC 210 Data Wrangling