Relational Data and Schemas

Sam Mason

Learning goals

Following this lecture, you should be able to:

Define the following terms: primary key, compound primary key, foreign key, candidate key, surrogate key
Identify an appropriate primary key for a given table
Sketch relational schemas

Packages

library(tidyverse) # {dplyr} is the relational data workhorse
library(nycflights13) # contains several related tables
library(nasaweather) # contains several related tables
library(Lahman) # contains several related tables

Working across many tables

Thus far in lecture we have focused on wrangling one data frame at a time
“In the wild” data is often stored in relational databases
- Collections of thematic tables (a.k.a data frames)
- Observations in one table relate to observations in other tables
Relational databases are easier to work with and maintain when compared to a single-data-frame approach

Case study: Hogwarts registrar

The data frame below is an example of the of records that the registrar’s office at Hogwarts might maintain.

registrar

# A tibble: 12 × 8
   first_stud last_stud  year house     class first_prof last_prof location
   <chr>      <chr>     <dbl> <chr>     <chr> <chr>      <chr>     <chr>   
 1 Harry      Potter        2 Gryffind… Char… Filius     Flitwick  Classro…
 2 Harry      Potter        2 Gryffind… Tran… Minerva    McGonaga… Classro…
 3 Hermione   Granger       2 Gryffind… Poti… Severus    Snape     Dungeons
 4 Hermione   Granger       2 Gryffind… Char… Filius     Flitwick  Classro…
 5 Hermione   Granger       2 Gryffind… Tran… Minerva    McGonaga… Classro…
 6 Hermione   Granger       2 Gryffind… Hist… Cuthbert   Binns     Classro…
 7 Cedric     Diggory       4 Hufflepu… Char… Filius     Flitwick  Classro…
 8 Draco      Malfoy        2 Slytherin Poti… Severus    Snape     Dungeons
 9 Draco      Malfoy        2 Slytherin Char… Filius     Flitwick  Classro…
10 Cho        Chang         3 Ravenclaw Poti… Severus    Snape     Dungeons
11 Cho        Chang         3 Ravenclaw Char… Filius     Flitwick  Classro…
12 Cho        Chang         3 Ravenclaw Hist… Cuthbert   Binns     Classro…

A few things to note:
- This data frame contains many repeated values
- Updating information can be tedious (at least conceptually)
  - Imagine charms has changed classrooms, you’d be responsible for updating the location value for many observations
- In the absence of documentation, the attribution of certain variables is ambiguous (e.g., does house correspond to student or professor?)
We can spot themes in this dataset
- We have information specific to students, and information specific to classes
Let’s reorganize this data (a process called normalization) to create separate tables for these two themes

students table
classes table

# A tibble: 5 × 4
  first    last     year house     
  <chr>    <chr>   <dbl> <chr>     
1 Harry    Potter      2 Gryffindor
2 Hermione Granger     2 Gryffindor
3 Cedric   Diggory     4 Hufflepuff
4 Draco    Malfoy      2 Slytherin 
5 Cho      Chang       3 Ravenclaw

# A tibble: 4 × 4
  class            first    last       location    
  <chr>            <chr>    <chr>      <chr>       
1 Potions          Severus  Snape      Dungeons    
2 Charms           Filius   Flitwick   Classroom 2E
3 Transfiguration  Minerva  McGonagall Classroom 1B
4 History of Magic Cuthbert Binns      Classroom 4F

Progress! We’ve reduced repeated values, made it so single updates need only be applied to single values, and mitigated variable attribution ambiguity.
But we’ve lost something as well — the relations in the data! Is Harry Potter taking charms?! How many classes is Cho Chang taking?!
We need a third table that relates each student to the classes they’re taking

# A tibble: 12 × 3
   first    last    class           
   <chr>    <chr>   <chr>           
 1 Harry    Potter  Charms          
 2 Harry    Potter  Transfiguration 
 3 Hermione Granger Potions         
 4 Hermione Granger Charms          
 5 Hermione Granger Transfiguration 
 6 Hermione Granger History of Magic
 7 Cedric   Diggory Charms          
 8 Draco    Malfoy  Potions         
 9 Draco    Malfoy  Charms          
10 Cho      Chang   Potions         
11 Cho      Chang   Charms          
12 Cho      Chang   History of Magic

In adding this third table, we’ve reintroduced a bit of repetition, and made it so updates to class names, for example, need to made in two tables (both enrollment and classes)
These concessions are required to ensure the relational nature of the database

Asking questions across tables

Thinking relationally to ask questions across datasets

How many students does Professor McGonagall have?
- Need data from both classes and enrollment
Are there any classes without a Gryffindor student?
- Need data from both students and enrollment
Does Professor Snape teach any Slytherin students?
- Need data from students, enrollment and classes

Relational keys

Let’s get a bit of vocabulary out of the way
When looking at a single table in a relational database:
- A primary key is one or more columns that uniquely identifies each observation
- A foreign key is one or more columns (in the focal table) that uniquely identifies each observation in a related table.
Primary-foreign key pairs are the “connections” that relate observations in separate tables
We can visualize these connections using diagrams called schemas
The schema for the Hogwarts database is shown below

Each table gets its own rectangle containing the names of its variables
Variables that make up the primary key in a table are colored grey
Relations are shown using arrows that start at the primary key and end at the foreign key

The `{nasaweather}` database

The {nasaweather} package contains three tables of interest: storms, atmos, and elev
- Each observation in storms is a particular named storm (e.g., hurricane, tropical storm, etc.) at a particular point in time (measured in 6 hour increments) from 1995 to 2000
- Each observation in atmos is an atmospheric weather reading taken in a particular month at a particular location from 1995 to 2000
- Each observation in elev is the elevation (in meters) above sea level for a particular location
The map below shows the spatial extent of the data

Code

world <- map_data("world")
world |>
  ggplot() +
  geom_map(map = world, aes(long, lat, map_id = region),
           fill = NA, color = "black") +
  scale_x_continuous(limits = range(atmos$long)) +
  scale_y_continuous(limits = range(atmos$lat)) +
  labs(x = "Longitude", y = "Latitude") +
  coord_equal() +
  theme(text = element_text(size = 20))

Finding primary keys

A primary key uniquely identifies each observation
No primary key value should match to more than one row
Let’s find the primary key in nasaweather::storms
We’ll start by looking at the data and making a guess
Each observation is a particular storm at a particular six-hour interval

storms |>
  count(name, hour) |> # shorthand for group_by(name, hour) |>
                       # summarize(n = n())
  filter(n > 1) # return all instances of key being associated with more than one observation

# A tibble: 314 × 3
   name     hour     n
   <chr>   <int> <int>
 1 Alberto     0    22
 2 Alberto     6    22
 3 Alberto    12    21
 4 Alberto    18    22
 5 Alex        0     6
 6 Alex        6     6
 7 Alex       12     7
 8 Alex       18     7
 9 Allison     0     9
10 Allison     6     8
# ℹ 304 more rows

Well name and hour certainly don’t make up the primary key
Most storms span multiple days, so we’ll want to add day to our next primary key attempt

storms |>
  count(name, day, hour) |>
  filter(n > 1)

# A tibble: 1 × 4
  name    day  hour     n
  <chr> <int> <int> <int>
1 Lenny    21     0     2

Much better, but still one storm that appears on the same day at the same hour
Let’s take a closer look at this storm

storms |>
  filter(name == "Lenny", day == 21)

# A tibble: 4 × 11
  name   year month   day  hour   lat  long pressure  wind type     seasday
  <chr> <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>      <int>
1 Lenny  1999    11    21     0  16   -59        998    40 Tropica…     174
2 Lenny  1999    11    21     0  16.5 -58.1      998    30 Tropica…     174
3 Lenny  1999    11    21    12  17.2 -57.1      999    30 Tropica…     174
4 Lenny  1999    11    21    18  18   -56.7     1000    25 Tropica…     174

Huh, it almost looks like there’s been a data entry error here — one of those two zeros should be a six
The zero observation that is farther northeast (more positive lat and long) with weaker winds is the likely culprit
Lacking the ability to confirm the validity of this edit (changing 0 to 6) with the data maintainers, we’ll opt to simply remove the observation altogether
With this row removed, name, day, and hour qualify as a compound [primary] key for this dataset

Candidate vs. primary keys

Well hold on, certainly there must be other combinations of variables that uniquely identify each observation
Turns out day, hour, lat, pressure and wind also make up a compound key that uniquely identifies each observation

storms |>
  count(day, hour, lat, pressure, wind) |>
  filter(n > 1)

# A tibble: 0 × 6
# ℹ 6 variables: day <int>, hour <int>, lat <dbl>, pressure <int>,
#   wind <int>, n <int>

This new key what we might call a candidate key — it functions like a primary key, but has two major flaws
- It is unintuitive on a human level — we don’t describe storms by their latitude and pressure
- It’s ability to uniquely identify observations is fleeting — we can easily imagine adding more recent observations to this dataset that have the same values of day, hour, lat, pressure and wind
I’d also argue that the previous name, day, hour key shouldn’t be treated as a primary key either — again, we could add new observations that have the same values for these three variables
To create a primary key that is both intuitive and enduring (will always uniquely identify new observations) we should combine name, year, day and hour

Surrogate keys

Compound primary keys can be tedious and prone to error
You may elect to create (mutate()) your own primary key called a surrogate key
The surrogate key may be a simple numeric column

storms |>
  mutate(id = row_number(),
         .before = name)

# A tibble: 2,747 × 12
      id name     year month   day  hour   lat  long pressure  wind type   
   <int> <chr>   <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>  
 1     1 Allison  1995     6     3     0  17.4 -84.3     1005    30 Tropic…
 2     2 Allison  1995     6     3     6  18.3 -84.9     1004    30 Tropic…
 3     3 Allison  1995     6     3    12  19.3 -85.7     1003    35 Tropic…
 4     4 Allison  1995     6     3    18  20.6 -85.8     1001    40 Tropic…
 5     5 Allison  1995     6     4     0  22   -86        997    50 Tropic…
 6     6 Allison  1995     6     4     6  23.3 -86.3      995    60 Tropic…
 7     7 Allison  1995     6     4    12  24.7 -86.2      987    65 Hurric…
 8     8 Allison  1995     6     4    18  26.2 -86.2      988    65 Hurric…
 9     9 Allison  1995     6     5     0  27.6 -86.1      988    65 Hurric…
10    10 Allison  1995     6     5     6  28.5 -85.6      990    60 Tropic…
# ℹ 2,737 more rows
# ℹ 1 more variable: seasday <int>

Or it may be composed of more informative values

storms |>
  mutate(id = str_c(name, year, day, hour),
         .before = name)

# A tibble: 2,747 × 12
   id        name   year month   day  hour   lat  long pressure  wind type 
   <chr>     <chr> <int> <int> <int> <int> <dbl> <dbl>    <int> <int> <chr>
 1 Allison1… Alli…  1995     6     3     0  17.4 -84.3     1005    30 Trop…
 2 Allison1… Alli…  1995     6     3     6  18.3 -84.9     1004    30 Trop…
 3 Allison1… Alli…  1995     6     3    12  19.3 -85.7     1003    35 Trop…
 4 Allison1… Alli…  1995     6     3    18  20.6 -85.8     1001    40 Trop…
 5 Allison1… Alli…  1995     6     4     0  22   -86        997    50 Trop…
 6 Allison1… Alli…  1995     6     4     6  23.3 -86.3      995    60 Trop…
 7 Allison1… Alli…  1995     6     4    12  24.7 -86.2      987    65 Hurr…
 8 Allison1… Alli…  1995     6     4    18  26.2 -86.2      988    65 Hurr…
 9 Allison1… Alli…  1995     6     5     0  27.6 -86.1      988    65 Hurr…
10 Allison1… Alli…  1995     6     5     6  28.5 -85.6      990    60 Trop…
# ℹ 2,737 more rows
# ℹ 1 more variable: seasday <int>

Note: Surrogate keys and relations

Creating a surrogate key in one table necessitates re-creating the same surrogate key (as a foreign key) in other tables if you want to create relations using the surrogate.

The `{nycflights13}` database

The relations between flights, airports, weather, planes and airlines are shown in the schema below

Note: Schema column names

Most of these table contain more variables than are shown in the schema. To simplify these diagrams, only variables that make up primary and foreign keys are included — all others are indicated by the ellipsis (…).

Which tables have compound primary keys?
How many foreign keys does the airports table have in flights?
How many foreign keys does the weather table have in flights?
Does flights have a foreign key in any of the other tables?
I’d like to know the full name of the airline that is the most cautious (has the fewest number of flights that take-off in inclement weather). Which tables would I need to answer to this question?

flights and weather
Two; origin and dest
One; a compound foreign key composed of time_hour and origin
No; none of the other tables have the set of columns that compose the primary key in flights
airlines, flights and weather

Finding relations

Steps to identifying relations between tables

Identify the primary key in both tables
Look for common variables (the same variable may go by different names in different tables)
If all variables that make up the primary key in one table are present in the second table (as a foreign key) then BOOM, you’ve got yourself a relation

airports and weather can be related through flights, but they also share a direct relation — let’s find it!
The primary key in airports is faa, which looks like this

sample(airports$faa, size = 10)

 [1] "LHD" "OLH" "IZG" "BYS" "PIR" "MRN" "TYE" "GRI" "LAR" "SAD"

The primary key in weather is compound and made up of time_hour and origin, which look like this

sample(weather$time_hour, size = 10)

 [1] "2013-08-14 14:00:00 EDT" "2013-06-11 19:00:00 EDT"
 [3] "2013-06-03 22:00:00 EDT" "2013-09-24 07:00:00 EDT"
 [5] "2013-11-14 03:00:00 EST" "2013-06-24 08:00:00 EDT"
 [7] "2013-07-09 16:00:00 EDT" "2013-06-08 00:00:00 EDT"
 [9] "2013-03-19 04:00:00 EDT" "2013-02-22 20:00:00 EST"

sample(weather$origin, size = 10)

 [1] "JFK" "EWR" "LGA" "LGA" "JFK" "EWR" "LGA" "LGA" "EWR" "EWR"

We can read in the documentation for these two data frames that airports$faa and weather$origin are the same variable — the FAA code for the airport corresponding to each observation
We can conclude that these two table are related via the primary to foreign key pair composed of airports$faa and weather$origin
We can imagine appending a new arrow to the schema that begins at faa and ends at origin

In-class exercises

Exercises
Solutions

Identify the primary keys in Batting, People, and Salaries data frames from the Lahman package.
Draw a schema that properly relates these three tables.
Identify the primary keys in Fielding and Pitching
Draw a schema that properly relates Batting, Fielding, and Pitching

Relational Data and Schemas

Learning goals

Packages

Working across many tables

Case study: Hogwarts registrar

Asking questions across tables

Relational keys

The {nasaweather} database

Finding primary keys

Candidate vs. primary keys

Surrogate keys

The {nycflights13} database

In-class exercises

Finding relations

In-class exercises

The `{nasaweather}` database

The `{nycflights13}` database