Following this lecture, you should be able to:
Distinguish between atomic vectors and lists
Identify the four fundamental atomic vector types, and the three “complex” atomic vector types
Work with (create, mutate, summarize) each of the seven atomic vector types
Describe the relationship between atomic vectors, lists, data frames, and tibbles
Create tibbles
The {stringr}
package helps us work with character vectors
The {lubridate}
package helps us work with date and date-time vectors
The {tibble}
package allows us to work with tibbles
Before discussing tibbles, we need to learn some R basics
What is an atomic vector?
What is a list?
What is a data frame?
Big idea: What are tibbles?
A data frame is a list of atomic vectors, and tibbles are just “better” data frames.
R supports two fundamental types of vector
Atomic vectors (simply called “vectors”)
Lists
Both are data structures that hold many elements
c(TRUE, FALSE)
c(1L, 2L, 3L)
c(1.2, 3.4)
c("D", "S", "C")
int
s and dbl
sTip: Checking vector types
We can use the typeof()
function on a vector to return the type.
Big idea: Atomic vectors are homogeneous
All elements of a given atomic vector must be of the same type.
Functions like as.numeric()
, as.logical()
, etc.
Certain conversion are not possible (e.g., words to numbers), or give undesirable results (e.g., decimals to integers)
mutate()
to create indicator variablespenguins |>
drop_na(species, sex, body_mass_g) |>
group_by(species, sex) |>
mutate(overweight = body_mass_g > (mean(body_mass_g) + (sd(body_mass_g)))) |>
slice_sample(n = 2) |> # two random obs from each group
select(species, sex, body_mass_g, overweight)
# A tibble: 12 Ă— 4
# Groups: species, sex [6]
species sex body_mass_g overweight
<fct> <fct> <int> <lgl>
1 Adelie female 3800 TRUE
2 Adelie female 2925 FALSE
3 Adelie male 4700 TRUE
4 Adelie male 4300 FALSE
5 Chinstrap female 3700 FALSE
6 Chinstrap female 3300 FALSE
7 Chinstrap male 4050 FALSE
8 Chinstrap male 3950 FALSE
9 Gentoo female 4300 FALSE
10 Gentoo female 3950 FALSE
11 Gentoo male 5700 FALSE
12 Gentoo male 5000 FALSE
any()
and all()
can be used to summarize a logical vector into a logical scalarsummarize()
to summarize indicator variablespenguins |>
drop_na(species, sex, body_mass_g) |>
group_by(species, sex) |>
mutate(obese = body_mass_g > (mean(body_mass_g) + (2 * sd(body_mass_g)))) |>
summarize(obese = any(obese))
# A tibble: 6 Ă— 3
# Groups: species [3]
species sex obese
<fct> <fct> <lgl>
1 Adelie female FALSE
2 Adelie male TRUE
3 Chinstrap female TRUE
4 Chinstrap male TRUE
5 Gentoo female FALSE
6 Gentoo male TRUE
nycflights13::flights
called “red_eye” that is TRUE
when the flights departs between 22:00 and 23:59, and arrives no later than 05:00.any()
of the NYC airports have red eye flights on the 4th of July?. Make sure to drop_na()
from the dep_time
and arr_time
columns first.# exercise 1
flights |>
mutate(red_eye = between(dep_time, 2200, 2359) & arr_time <= 500,
.keep = "used")
# exercise 2
flights |>
drop_na(dep_time, arr_time) |>
mutate(red_eye = between(dep_time, 2200, 2359) & arr_time <= 500) |>
filter(month == 7, day == 4) |>
group_by(origin) |>
summarize(any_red_eyes = any(red_eye))
Note: We’ve been here before
We’ve already discussed several ways in which we can mutate()
and summarize()
numbers. Here are even more cool things we can do!
round()
functioncumsum()
and cummean()
mutate()
quantile()
function we can find values at or below which a given percentage of the data fallsquantile()
function works well within summarize()
to mark important positions in the distribution of a given variablepenguins |>
drop_na(species, sex) |>
group_by(species, sex) |>
summarize(flipper_0.1 = quantile(flipper_length_mm,
probs = 0.1,
na.rm = TRUE))
# A tibble: 6 Ă— 3
# Groups: species [3]
species sex flipper_0.1
<fct> <fct> <dbl>
1 Adelie female 181
2 Adelie male 184
3 Chinstrap female 186.
4 Chinstrap male 193
5 Gentoo female 208
6 Gentoo male 215
drop_na()
from the departure delay column first.digits=
argument can be negative).distance
variable? That is, what is the value of distance
below which 99% of the flights fall?# exercise 3
flights |>
filter(month == 12, day == 25) |>
drop_na(dep_delay) |>
arrange(dep_time) |>
mutate(cumul_dep_delay = cumsum(dep_delay))
# exercise 4
flights |>
filter(month == 12, day == 25) |>
drop_na(dep_delay) |>
arrange(dep_time) |>
mutate(cumul_dep_delay = cumsum(dep_delay),
cumul_dep_delay = round(cumul_dep_delay, digits = -2))
# exercise 5
flights |>
summarize(distance_0.99 = quantile(distance, probs = 0.99))
Tip: The {stringr}
package
The {stringr}
package (included in {tidyverse}
) makes working with character vectors easy!
fruit
character vector is a built-in data object from {stringr}
Tip: The {stringr}
cheatsheet
We’ve only scratched the surface of what the {stringr}
package can help us to do! If you want to read more, check out the {stringr}
cheatsheet.
In R, categorical variables are stored as factors
Calendar dates are stored as date objects
Specific moments in time (date-times) are stored as POSIXct objects
Each built “on top of” a fundamental atomic
We create factors using the factor()
function
By default, factor levels are listed in ascending order (either alphabetically or numerically)
levels=
argumentTip: The {lubridate}
package
The {lubridate}
package (included in {tidyverse}
) brings order to the chaos (e.g., timezones, leap days, daylight saving time, etc.) that swirls around humankind’s time-keeping systems.
storms |>
mutate(date = make_date(year = year, month = month, day = day),
date_time = make_datetime(year = year, month = month,
day = day, hour = hour),
.keep = "used")
# A tibble: 19,537 Ă— 6
year month day hour date date_time
<dbl> <dbl> <int> <dbl> <date> <dttm>
1 1975 6 27 0 1975-06-27 1975-06-27 00:00:00
2 1975 6 27 6 1975-06-27 1975-06-27 06:00:00
3 1975 6 27 12 1975-06-27 1975-06-27 12:00:00
4 1975 6 27 18 1975-06-27 1975-06-27 18:00:00
5 1975 6 28 0 1975-06-28 1975-06-28 00:00:00
6 1975 6 28 6 1975-06-28 1975-06-28 06:00:00
7 1975 6 28 12 1975-06-28 1975-06-28 12:00:00
8 1975 6 28 18 1975-06-28 1975-06-28 18:00:00
9 1975 6 29 0 1975-06-29 1975-06-29 00:00:00
10 1975 6 29 6 1975-06-29 1975-06-29 06:00:00
# ℹ 19,527 more rows
mdy()
(e.g., ymd()
), including those that can parse strings with both date and time information# A tibble: 80,332 Ă— 4
datetime date_time state shape
<chr> <dttm> <chr> <chr>
1 10/10/1949 20:30 1949-10-10 20:30:00 tx cylinder
2 10/10/1949 21:00 1949-10-10 21:00:00 tx light
3 10/10/1955 17:00 1955-10-10 17:00:00 <NA> circle
4 10/10/1956 21:00 1956-10-10 21:00:00 tx circle
5 10/10/1960 20:00 1960-10-10 20:00:00 hi light
6 10/10/1961 19:00 1961-10-10 19:00:00 tn sphere
7 10/10/1965 21:00 1965-10-10 21:00:00 <NA> circle
8 10/10/1965 23:45 1965-10-10 23:45:00 ct disk
9 10/10/1966 20:00 1966-10-10 20:00:00 al disk
10 10/10/1966 21:00 1966-10-10 21:00:00 fl disk
# ℹ 80,322 more rows
Unlike atomic vectors, lists do not need to be homogeneous
Unlike atomic vectors, the elements of a list aren’t values (scalars), but entire objects
We index lists a bit differently than we do vectors
Let’s create a list that contains a heterogeneous collection of elements
This list hold three objects: (1) a double vector, (2) a character vector with one value, and (3) a logical vector
I can access the third object using double-bracket indexing
In R, datasets are stored as data frame objects
Let’s look at one of R’s built-in data frame objects, mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Each column is an atomic vector object within the list
Column names are the element names of the list
Tibbles improve on data frames in many ways, but the most obvious benefit to us is in printing
For a more thorough exploration of the differences between data frames and tibbles call vignette("tibble")
We can use as_tibble()
to convert a base R data frame to a tibble
# A tibble: 150 Ă— 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
tibble(int_var = c(1L, 2L, 3L),
chr_var = c("red", "blue", "green"),
date_var = mdy("10/17/24", "11/5/24", "11/27/24"),
fct_var = factor(c("dog", "cat", "dog")),
dbl_var = c(3.14, 2.71, 6.02))
# A tibble: 3 Ă— 5
int_var chr_var date_var fct_var dbl_var
<int> <chr> <date> <fct> <dbl>
1 1 red 2024-10-17 dog 3.14
2 2 blue 2024-11-05 cat 2.71
3 3 green 2024-11-27 dog 6.02
* Note: R supports more than just atomic vector and list objects
DSC 210 Data Wrangling