Following this lecture, you should be able to:
Define the following terms: primary key, compound primary key, foreign key, candidate key, surrogate key
Identify an appropriate primary key for a given table
Sketch relational schemas
Thus far in lecture we have focused on wrangling one data frame at a time
“In the wild” data is often stored in relational databases
Collections of thematic tables (a.k.a data frames)
Observations in one table relate to observations in other tables
Relational databases are easier to work with and maintain when compared to a single-data-frame approach
The data frame below is an example of the of records that the registrar’s office at Hogwarts might maintain.
# A tibble: 12 × 8
first_stud last_stud year house class first_prof last_prof location
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Harry Potter 2 Gryffind… Char… Filius Flitwick Classro…
2 Harry Potter 2 Gryffind… Tran… Minerva McGonaga… Classro…
3 Hermione Granger 2 Gryffind… Poti… Severus Snape Dungeons
4 Hermione Granger 2 Gryffind… Char… Filius Flitwick Classro…
5 Hermione Granger 2 Gryffind… Tran… Minerva McGonaga… Classro…
6 Hermione Granger 2 Gryffind… Hist… Cuthbert Binns Classro…
7 Cedric Diggory 4 Hufflepu… Char… Filius Flitwick Classro…
8 Draco Malfoy 2 Slytherin Poti… Severus Snape Dungeons
9 Draco Malfoy 2 Slytherin Char… Filius Flitwick Classro…
10 Cho Chang 3 Ravenclaw Poti… Severus Snape Dungeons
11 Cho Chang 3 Ravenclaw Char… Filius Flitwick Classro…
12 Cho Chang 3 Ravenclaw Hist… Cuthbert Binns Classro…
A few things to note:
This data frame contains many repeated values
Updating information can be tedious (at least conceptually)
In the absence of documentation, the attribution of certain variables is ambiguous (e.g., does house correspond to student or professor?)
We can spot themes in this dataset
Let’s reorganize this data (a process called normalization) to create separate tables for these two themes
# A tibble: 5 × 4
first last year house
<chr> <chr> <dbl> <chr>
1 Harry Potter 2 Gryffindor
2 Hermione Granger 2 Gryffindor
3 Cedric Diggory 4 Hufflepuff
4 Draco Malfoy 2 Slytherin
5 Cho Chang 3 Ravenclaw
# A tibble: 4 × 4
class first last location
<chr> <chr> <chr> <chr>
1 Potions Severus Snape Dungeons
2 Charms Filius Flitwick Classroom 2E
3 Transfiguration Minerva McGonagall Classroom 1B
4 History of Magic Cuthbert Binns Classroom 4F
Progress! We’ve reduced repeated values, made it so single updates need only be applied to single values, and mitigated variable attribution ambiguity.
But we’ve lost something as well — the relations in the data! Is Harry Potter taking charms?! How many classes is Cho Chang taking?!
We need a third table that relates each student to the classes they’re taking
# A tibble: 12 × 3
first last class
<chr> <chr> <chr>
1 Harry Potter Charms
2 Harry Potter Transfiguration
3 Hermione Granger Potions
4 Hermione Granger Charms
5 Hermione Granger Transfiguration
6 Hermione Granger History of Magic
7 Cedric Diggory Charms
8 Draco Malfoy Potions
9 Draco Malfoy Charms
10 Cho Chang Potions
11 Cho Chang Charms
12 Cho Chang History of Magic
In adding this third table, we’ve reintroduced a bit of repetition, and made it so updates to class names, for example, need to made in two tables (both enrollment and classes)
These concessions are required to ensure the relational nature of the database
Thinking relationally to ask questions across datasets
How many students does Professor McGonagall have?
classes and enrollmentAre there any classes without a Gryffindor student?
students and enrollmentDoes Professor Snape teach any Slytherin students?
students, enrollment and classesLet’s get a bit of vocabulary out of the way
When looking at a single table in a relational database:
A primary key is one or more columns that uniquely identifies each observation
A foreign key is one or more columns (in the focal table) that uniquely identifies each observation in a related table.
Primary-foreign key pairs are the “connections” that relate observations in separate tables
We can visualize these connections using diagrams called schemas
The schema for the Hogwarts database is shown below

{nasaweather} databaseThe {nasaweather} package contains three tables of interest: storms, atmos, and elev
Each observation in storms is a particular named storm (e.g., hurricane, tropical storm, etc.) at a particular point in time (measured in 6 hour increments) from 1995 to 2000
Each observation in atmos is an atmospheric weather reading taken in a particular month at a particular location from 1995 to 2000
Each observation in elev is the elevation (in meters) above sea level for a particular location
The map below shows the spatial extent of the data
world <- map_data("world")
world |>
ggplot() +
geom_map(map = world, aes(long, lat, map_id = region),
fill = NA, color = "black") +
scale_x_continuous(limits = range(atmos$long)) +
scale_y_continuous(limits = range(atmos$lat)) +
labs(x = "Longitude", y = "Latitude") +
coord_equal() +
theme(text = element_text(size = 20))
A primary key uniquely identifies each observation
No primary key value should match to more than one row
Let’s find the primary key in nasaweather::storms
We’ll start by looking at the data and making a guess
Each observation is a particular storm at a particular six-hour interval
storms |>
count(name, hour) |> # shorthand for group_by(name, hour) |>
# summarize(n = n())
filter(n > 1) # return all instances of key being associated with more than one observation# A tibble: 314 × 3
name hour n
<chr> <int> <int>
1 Alberto 0 22
2 Alberto 6 22
3 Alberto 12 21
4 Alberto 18 22
5 Alex 0 6
6 Alex 6 6
7 Alex 12 7
8 Alex 18 7
9 Allison 0 9
10 Allison 6 8
# ℹ 304 more rows
Well name and hour certainly don’t make up the primary key
Most storms span multiple days, so we’ll want to add day to our next primary key attempt
# A tibble: 1 × 4
name day hour n
<chr> <int> <int> <int>
1 Lenny 21 0 2
Much better, but still one storm that appears on the same day at the same hour
Let’s take a closer look at this storm
# A tibble: 4 × 11
name year month day hour lat long pressure wind type seasday
<chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr> <int>
1 Lenny 1999 11 21 0 16 -59 998 40 Tropica… 174
2 Lenny 1999 11 21 0 16.5 -58.1 998 30 Tropica… 174
3 Lenny 1999 11 21 12 17.2 -57.1 999 30 Tropica… 174
4 Lenny 1999 11 21 18 18 -56.7 1000 25 Tropica… 174
Huh, it almost looks like there’s been a data entry error here — one of those two zeros should be a six
The zero observation that is farther northeast (more positive lat and long) with weaker winds is the likely culprit
Lacking the ability to confirm the validity of this edit (changing 0 to 6) with the data maintainers, we’ll opt to simply remove the observation altogether
With this row removed, name, day, and hour qualify as a compound [primary] key for this dataset
Well hold on, certainly there must be other combinations of variables that uniquely identify each observation
Turns out day, hour, lat, pressure and wind also make up a compound key that uniquely identifies each observation
# A tibble: 0 × 6
# ℹ 6 variables: day <int>, hour <int>, lat <dbl>, pressure <int>,
# wind <int>, n <int>
This new key what we might call a candidate key — it functions like a primary key, but has two major flaws
It is unintuitive on a human level — we don’t describe storms by their latitude and pressure
It’s ability to uniquely identify observations is fleeting — we can easily imagine adding more recent observations to this dataset that have the same values of day, hour, lat, pressure and wind
I’d also argue that the previous name, day, hour key shouldn’t be treated as a primary key either — again, we could add new observations that have the same values for these three variables
To create a primary key that is both intuitive and enduring (will always uniquely identify new observations) we should combine name, year, day and hour
Compound primary keys can be tedious and prone to error
You may elect to create (mutate()) your own primary key called a surrogate key
The surrogate key may be a simple numeric column
# A tibble: 2,747 × 12
id name year month day hour lat long pressure wind type
<int> <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr>
1 1 Allison 1995 6 3 0 17.4 -84.3 1005 30 Tropic…
2 2 Allison 1995 6 3 6 18.3 -84.9 1004 30 Tropic…
3 3 Allison 1995 6 3 12 19.3 -85.7 1003 35 Tropic…
4 4 Allison 1995 6 3 18 20.6 -85.8 1001 40 Tropic…
5 5 Allison 1995 6 4 0 22 -86 997 50 Tropic…
6 6 Allison 1995 6 4 6 23.3 -86.3 995 60 Tropic…
7 7 Allison 1995 6 4 12 24.7 -86.2 987 65 Hurric…
8 8 Allison 1995 6 4 18 26.2 -86.2 988 65 Hurric…
9 9 Allison 1995 6 5 0 27.6 -86.1 988 65 Hurric…
10 10 Allison 1995 6 5 6 28.5 -85.6 990 60 Tropic…
# ℹ 2,737 more rows
# ℹ 1 more variable: seasday <int>
# A tibble: 2,747 × 12
id name year month day hour lat long pressure wind type
<chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <chr>
1 Allison1… Alli… 1995 6 3 0 17.4 -84.3 1005 30 Trop…
2 Allison1… Alli… 1995 6 3 6 18.3 -84.9 1004 30 Trop…
3 Allison1… Alli… 1995 6 3 12 19.3 -85.7 1003 35 Trop…
4 Allison1… Alli… 1995 6 3 18 20.6 -85.8 1001 40 Trop…
5 Allison1… Alli… 1995 6 4 0 22 -86 997 50 Trop…
6 Allison1… Alli… 1995 6 4 6 23.3 -86.3 995 60 Trop…
7 Allison1… Alli… 1995 6 4 12 24.7 -86.2 987 65 Hurr…
8 Allison1… Alli… 1995 6 4 18 26.2 -86.2 988 65 Hurr…
9 Allison1… Alli… 1995 6 5 0 27.6 -86.1 988 65 Hurr…
10 Allison1… Alli… 1995 6 5 6 28.5 -85.6 990 60 Trop…
# ℹ 2,737 more rows
# ℹ 1 more variable: seasday <int>
Note: Surrogate keys and relations
Creating a surrogate key in one table necessitates re-creating the same surrogate key (as a foreign key) in other tables if you want to create relations using the surrogate.
{nycflights13} databaseThe relations between flights, airports, weather, planes and airlines are shown in the schema below

Note: Schema column names
Most of these table contain more variables than are shown in the schema. To simplify these diagrams, only variables that make up primary and foreign keys are included — all others are indicated by the ellipsis (…).
airports table have in flights?weather table have in flights?flights have a foreign key in any of the other tables?flights and weatherorigin and desttime_hour and originflightsairlines, flights and weatherSteps to identifying relations between tables
airports and weather can be related through flights, but they also share a direct relation — let’s find it!
The primary key in airports is faa, which looks like this
weather is compound and made up of time_hour and origin, which look like this [1] "2013-08-14 14:00:00 EDT" "2013-06-11 19:00:00 EDT"
[3] "2013-06-03 22:00:00 EDT" "2013-09-24 07:00:00 EDT"
[5] "2013-11-14 03:00:00 EST" "2013-06-24 08:00:00 EDT"
[7] "2013-07-09 16:00:00 EDT" "2013-06-08 00:00:00 EDT"
[9] "2013-03-19 04:00:00 EDT" "2013-02-22 20:00:00 EST"
[1] "JFK" "EWR" "LGA" "LGA" "JFK" "EWR" "LGA" "LGA" "EWR" "EWR"
We can read in the documentation for these two data frames that airports$faa and weather$origin are the same variable — the FAA code for the airport corresponding to each observation
We can conclude that these two table are related via the primary to foreign key pair composed of airports$faa and weather$origin
We can imagine appending a new arrow to the schema that begins at faa and ends at origin
DSC 210 Data Wrangling