Following this lecture, you should be able to:
All of our relational database work thus far has been in small, flat-file databases (FFDBs)
Small datasets stored locally
Application specific (.rds) or common file formats (.csv)
Tables stored on disk and pulled into memory
library(nycflights13)
read_csv()
Computer memory
Most modern personal computers have ~ 8GB of RAM
We run into issues when our FFDB won’t fit into memory
Database integrity
FFDBs have no idea that they’re relational databases
Updates to specific files can inadvertently break relations
Database access
Database security
Designed for building and querying databases in SQL
SQL is currently the industry standard for interfacing with relational databases
Developed about 50 years ago
Built on relational algebra and tuple (relational) calculus
Assumes rectangular data (organized into rows and columns)
There also exist “NoSQL” database management systems
ssh dsc210@10.100.50.88
mysql -u dsc210 -p
USE lahman;
Note: accessing the SQL server
You can only access the SQL server when on Gordon’s network.
lahman
databaseMaintained by Sean Lahman, journalist and baseball enthusiast
Our copy of the database is made up of more than two dozen tables (1871 through 2019)
people
: all professional baseball players
batting
: batting statistics (e.g., at bats, hits, home runs)
pitching
: pitching statistics (e.g., pitch count, strikes, ERA)
fielding
: fielding statistics (e.g., errors, double plays)
Important: observation definitions
In the lahman
database, the definition of each observations changes depending on the table.
people
: each observation is a professional baseball player
batting
, pitching
, fielding
: each observation is a particular “stint” in a particular year for a particular player
SHOW TABLES;
will list all datasets in the lahman
database
DESCRIBE people;
will give you a breakdown of people
, indicating key columns, NULL columns, etc.
You’ll often have a schema to help you, but not always (may God have mercy on your soul)
SELECT…FROM
queriesEvery SQL query requires the two keywords (functions) SELECT
and FROM
SQL order of evaluation
FROM people
SELECT nameFirst, nameLast
Tip: selecting all columns
You can select all columns from a given table using *
. For example, SELECT * FROM people
will return the entire people
table.
WHERE
Note: ending SQL queries
Use ;
to indicate the end of a SQL query.
SQL order of evaluation
FROM people
WHERE birthYear > 1993
SELECT nameFirst, nameLast
Tip: using Boolean and logical operators
The WHERE
keyword supports the following operators:
equal to =
not equal to <>
or !=
greater than >
; greater than or equal to >=
less than <
; less than or equal to <=
AND
; OR
Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;
.
Tip: string literals in MySQL
In MySQL you can create string literals using either single or double quotes, but single quotes are more commonly used.
height
is recorded in inches)W
column records wins in a single stint)HR
) and at least 100 RBI
s in a single stint'BOS'
) fielders not making an error (E
) in a stintUse SELECT DISTINCT state FROM parks;
to see how state data is stored
DISTINCT
works just like dplyr::distinct()
, giving you only the unique values
# exercise 1
SELECT nameFirst, nameLast, height
FROM people
WHERE height < 72;
# exercise 2
SELECT playerID, W
FROM pitching
WHERE W >= 20;
# exercise 3
SELECT playerID, HR, RBI
FROM batting
WHERE HR > 45 AND RBI >= 100;
# exercise 4
SELECT playerID, teamID, E
FROM fielding
WHERE teamID = 'BOS' AND E = 0;
# exercise 5
SELECT parkname, state
FROM parks
WHERE state = 'CA' OR state = 'FL' OR state = 'TX';
BETWEEN
SQL
OR
) with IN
We can filter for observations containing particular string patterns using SQL’s %
and _
wilcards
%
represents 0 or more characters
_
represents a single character
Wildcards are used in conjunction with the LIKE
keyword
Let’s find all players with a last name that starts with “A”
Tip: limiting results
When your query returns many rows, you may want to use the LIMIT
keyword to look at just the first several.
Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;
.
BETWEEN
) who had more than 20 lossesSELECT DISTINCT
) player last names that end in “son”# exercise 6
SELECT parkname, state
FROM parks
WHERE state IN ('WA', 'OR', 'CA');
# exercise 7
SELECT playerID, year, L
FROM pitching
WHERE yearID BETWEEN 1900 AND 2000 AND L > 20;
# exercise 8
SELECT DISTINCT nameLast
FROM people
WHERE nameLast LIKE '%son';
# exercise 9
SELECT parkname
FROM parks
WHERE parkname LIKE '%Ball%' OR parkname LIKE '%ball%';
# exercise 10
SELECT DISTINCT nameFirst
FROM people
WHERE nameFirst LIKE '%e_';
ORDER BY
dplyr::arrange()
, SQL’s ORDER BY
keyword is used to sort rows in either ascending or descending orderSQL order of evaluation
FROM salaries
WHERE salary > 20000000
SELECT salary, yearID
ORDER BY yearID
LIMIT 10
Tip: descending order
The DESC
keyword can be used in conjunction with ORDER BY
to sort rows in descending order. For example, ORDER BY yearID DESC
.
Unlike {dplyr}
, SQL doesn’t have a dedicated SUMMARIZE
keyword
We can summarize by calling functions directly on column names in the SELECT
line
Note: summarize order of operations
The summary functions (AVG()
, SUM()
, MIN()
, COUNT()
, etc.) are executed near the end of the order of evaluation along with the SELECT
keyword.
AS
AS
to create column aliasesGROUP BY
SQL
SQL order of evaluation
FROM batting
WHERE playerID = 'aaronha01'
GROUP BY yearID
SELECT...
Using the server connection you made earlier, write SQL queries to complete all exercises. Don’t forget to end your query with ;
.
In the last decade (ending in 2019), which pitcher holds the record for most strikeouts (SO
) in a single stint?
How many players bat and throw with opposite hands? Use COUNT(*)
as a summary function to give the total number of rows in the filtered dataset.
Which position (POS
; catcher, first baseman, outfielder, etc.) had the highest number of fielding errors (E
) in 2019?
Between 1980 and 2009 (inclusive), which team has the highest win-loss differential (total wins minus total losses over that 30 year span). SQL supports arithmetic operators in the SELECT
clause to summarize.
# exercise 11
SELECT playerID, SUM(SO) as total_so
FROM pitching
WHERE yearID BETWEEN 2010 AND 2019
GROUP BY playerID
ORDER BY total_so DESC
LIMIT 1;
# exercise 12
SELECT COUNT(*) AS ambidextrous
FROM people
WHERE (bats = 'L' AND throws = 'R') OR (bats = 'R' AND throws = 'L');
# exercise 13
SELECT POS AS position, SUM(E) AS total_errors
FROM fielding
WHERE yearID = 2019
GROUP BY POS
ORDER BY total_errors DESC;
# exercise 14
SELECT teamID, (SUM(W) - SUM(L)) AS differential
FROM teams
WHERE yearID BETWEEN 1980 AND 2009
GROUP BY teamID
ORDER BY differential DESC;
SQL order of evaluation
FROM people
WHERE people.playerID = 'scherma01'
JOIN pitching ON people.playerID = pitching.playerID
SELECT DISTINCT teamID
Note: optimized order of evaluation
The orders of evaluation that we’ve been looking at throughout this lecture are conceptual. In reality, the SQL software has the freedom to change the order of clauses (lines) in ways that optimize the retrieval of data. In this example, it is likely most efficient to filter people
first so that the JOIN
has fewer rows to operate on, but there is no guarantee that WHERE
will always be evaluated before JOIN
in all queries that you will write.
SQL
Note: other types of joins
MySQL supports all the same mutating joins that {dplyr}
does, including LEFT JOIN
and RIGHT JOIN
.
DSC 210 Data Wrangling