Vectors, Lists, and Tibbles

Sam Mason

Learning goals

Following this lecture, you should be able to:

Distinguish between atomic vectors and lists
Identify the four fundamental atomic vector types, and the three “complex” atomic vector types
Work with (create, mutate, summarize) each of the seven atomic vector types
Describe the relationship between atomic vectors, lists, data frames, and tibbles
Create tibbles

Packages

library(tidyverse) # including {stringr}, {lubridate}, and {tibble}
library(palmerpenguins) # for the penguins dataset
library(nycflights13) # for the flights dataset

The {stringr} package helps us work with character vectors
The {lubridate} package helps us work with date and date-time vectors
The {tibble} package allows us to work with tibbles

First, a bit of base R

Before discussing tibbles, we need to learn some R basics
- What is an atomic vector?
- What is a list?
- What is a data frame?

Big idea: What are tibbles?

A data frame is a list of atomic vectors, and tibbles are just “better” data frames.

Vectors

R supports two fundamental types of vector
- Atomic vectors (simply called “vectors”)
- Lists
Both are data structures that hold many elements

The four fundamental atomics

Logical: c(TRUE, FALSE)
Integer: c(1L, 2L, 3L)
Double: c(1.2, 3.4)
Character: c("D", "S", "C")
Numeric: ints and dbls

Tip: Checking vector types

We can use the typeof() function on a vector to return the type.

Big idea: Atomic vectors are homogeneous

All elements of a given atomic vector must be of the same type.

Converting types

Functions like as.numeric(), as.logical(), etc.
Certain conversion are not possible (e.g., words to numbers), or give undesirable results (e.g., decimals to integers)

Logical vectors are the result of Boolean algebra

We can use can use Boolean algebra in mutate() to create indicator variables

penguins |>
  drop_na(species, sex, body_mass_g) |>
  group_by(species, sex) |>
  mutate(overweight = body_mass_g > (mean(body_mass_g) + (sd(body_mass_g)))) |>
  slice_sample(n = 2) |> # two random obs from each group
  select(species, sex, body_mass_g, overweight)

# A tibble: 12 × 4
# Groups:   species, sex [6]
   species   sex    body_mass_g overweight
   <fct>     <fct>        <int> <lgl>     
 1 Adelie    female        3800 TRUE      
 2 Adelie    female        2925 FALSE     
 3 Adelie    male          4700 TRUE      
 4 Adelie    male          4300 FALSE     
 5 Chinstrap female        3700 FALSE     
 6 Chinstrap female        3300 FALSE     
 7 Chinstrap male          4050 FALSE     
 8 Chinstrap male          3950 FALSE     
 9 Gentoo    female        4300 FALSE     
10 Gentoo    female        3950 FALSE     
11 Gentoo    male          5700 FALSE     
12 Gentoo    male          5000 FALSE

The functions any() and all() can be used to summarize a logical vector into a logical scalar

We can use these functions in summarize() to summarize indicator variables

penguins |>
  drop_na(species, sex, body_mass_g) |>
  group_by(species, sex) |>
  mutate(obese = body_mass_g > (mean(body_mass_g) + (2 * sd(body_mass_g)))) |>
  summarize(obese = any(obese))

# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    obese
  <fct>     <fct>  <lgl>
1 Adelie    female FALSE
2 Adelie    male   TRUE 
3 Chinstrap female TRUE 
4 Chinstrap male   TRUE 
5 Gentoo    female FALSE
6 Gentoo    male   TRUE

In-class exercises

Exercises
Solutions

Create a new indicator variable in nycflights13::flights called “red_eye” that is TRUE when the flights departs between 22:00 and 23:59, and arrives no later than 05:00.
Did any() of the NYC airports have red eye flights on the 4th of July?. Make sure to drop_na() from the dep_time and arr_time columns first.

# exercise 1
flights |>
  mutate(red_eye = between(dep_time, 2200, 2359) & arr_time <= 500,
         .keep = "used")

# exercise 2
flights |>
  drop_na(dep_time, arr_time) |>
  mutate(red_eye = between(dep_time, 2200, 2359) & arr_time <= 500) |>
  filter(month == 7, day == 4) |>
  group_by(origin) |>
  summarize(any_red_eyes = any(red_eye))

Working with numeric vectors

Note: We’ve been here before

We’ve already discussed several ways in which we can mutate() and summarize() numbers. Here are even more cool things we can do!

Mutating
Summarizing

We can reduce the precision of doubles using the round() function

We can calculate cumulative sums and means using cumsum() and cummean()

Cumulative arithmetic can be easily applied within mutate()

Using the quantile() function we can find values at or below which a given percentage of the data falls

The quantile() function works well within summarize() to mark important positions in the distribution of a given variable

penguins |>
  drop_na(species, sex) |>
  group_by(species, sex) |>
  summarize(flipper_0.1 = quantile(flipper_length_mm,
                                   probs = 0.1,
                                   na.rm = TRUE))

# A tibble: 6 × 3
# Groups:   species [3]
  species   sex    flipper_0.1
  <fct>     <fct>        <dbl>
1 Adelie    female        181 
2 Adelie    male          184 
3 Chinstrap female        186.
4 Chinstrap male          193 
5 Gentoo    female        208 
6 Gentoo    male          215

In-class exercises

Exercises
Solutions

Calculate the cumulative sum of departure delay minutes for December 25th. Make sure the rows are in chronological order first! You’ll want to drop_na() from the departure delay column first.
Round the cumulative sum to the nearest hundred minutes (the digits= argument can be negative).
What is the 0.99 quantile of the distance variable? That is, what is the value of distance below which 99% of the flights fall?

# exercise 3
flights |>
  filter(month == 12, day == 25) |>
  drop_na(dep_delay) |>
  arrange(dep_time) |>
  mutate(cumul_dep_delay = cumsum(dep_delay))

# exercise 4
flights |>
  filter(month == 12, day == 25) |>
  drop_na(dep_delay) |>
  arrange(dep_time) |>
  mutate(cumul_dep_delay = cumsum(dep_delay),
         cumul_dep_delay = round(cumul_dep_delay, digits = -2))

# exercise 5
flights |>
  summarize(distance_0.99 = quantile(distance, probs = 0.99))

Working with character vectors

Tip: The {stringr} package

The {stringr} package (included in {tidyverse}) makes working with character vectors easy!

The fruit character vector is a built-in data object from {stringr}

fruit[1:10]

 [1] "apple"        "apricot"      "avocado"      "banana"       "bell pepper" 
 [6] "bilberry"     "blackberry"   "blackcurrant" "blood orange" "blueberry"

Detect matches
Subset
Count characters
Replace

Tip: The {stringr} cheatsheet

We’ve only scratched the surface of what the {stringr} package can help us to do! If you want to read more, check out the {stringr} cheatsheet.

More complex atomic vectors

In R, categorical variables are stored as factors
Calendar dates are stored as date objects
Specific moments in time (date-times) are stored as POSIXct objects
Each built “on top of” a fundamental atomic

Creating factors

We create factors using the factor() function
By default, factor levels are listed in ascending order (either alphabetically or numerically)

Ordinals require proper level ordering through the levels= argument

chr_vec <- c("good", "okay", "okay", "very good", "great")
factor(chr_vec,
       levels = c("okay", "good", "very good", "great"),
       ordered = TRUE)

[1] good      okay      okay      very good great    
Levels: okay < good < very good < great

Creating dates and date-times

Tip: The {lubridate} package

The {lubridate} package (included in {tidyverse}) brings order to the chaos (e.g., timezones, leap days, daylight saving time, etc.) that swirls around humankind’s time-keeping systems.

From components
From strings

These functions work well when we have time component variables in our dataset

storms |>
  mutate(date = make_date(year = year, month = month, day = day),
         date_time = make_datetime(year = year, month = month,
                                   day = day, hour = hour),
         .keep = "used")

# A tibble: 19,537 × 6
    year month   day  hour date       date_time          
   <dbl> <dbl> <int> <dbl> <date>     <dttm>             
 1  1975     6    27     0 1975-06-27 1975-06-27 00:00:00
 2  1975     6    27     6 1975-06-27 1975-06-27 06:00:00
 3  1975     6    27    12 1975-06-27 1975-06-27 12:00:00
 4  1975     6    27    18 1975-06-27 1975-06-27 18:00:00
 5  1975     6    28     0 1975-06-28 1975-06-28 00:00:00
 6  1975     6    28     6 1975-06-28 1975-06-28 06:00:00
 7  1975     6    28    12 1975-06-28 1975-06-28 12:00:00
 8  1975     6    28    18 1975-06-28 1975-06-28 18:00:00
 9  1975     6    29     0 1975-06-29 1975-06-29 00:00:00
10  1975     6    29     6 1975-06-29 1975-06-29 06:00:00
# ℹ 19,527 more rows

There are many different variations of mdy() (e.g., ymd()), including those that can parse strings with both date and time information

sightings |>
  mutate(date_time = mdy_hm(datetime)) |>
  select(datetime, date_time, state, shape)

# A tibble: 80,332 × 4
   datetime         date_time           state shape   
   <chr>            <dttm>              <chr> <chr>   
 1 10/10/1949 20:30 1949-10-10 20:30:00 tx    cylinder
 2 10/10/1949 21:00 1949-10-10 21:00:00 tx    light   
 3 10/10/1955 17:00 1955-10-10 17:00:00 <NA>  circle  
 4 10/10/1956 21:00 1956-10-10 21:00:00 tx    circle  
 5 10/10/1960 20:00 1960-10-10 20:00:00 hi    light   
 6 10/10/1961 19:00 1961-10-10 19:00:00 tn    sphere  
 7 10/10/1965 21:00 1965-10-10 21:00:00 <NA>  circle  
 8 10/10/1965 23:45 1965-10-10 23:45:00 ct    disk    
 9 10/10/1966 20:00 1966-10-10 20:00:00 al    disk    
10 10/10/1966 21:00 1966-10-10 21:00:00 fl    disk    
# ℹ 80,322 more rows

Lists

Unlike atomic vectors, lists do not need to be homogeneous
Unlike atomic vectors, the elements of a list aren’t values (scalars), but entire objects

We index lists a bit differently than we do vectors
Let’s create a list that contains a heterogeneous collection of elements

my_list <- list(c(1.23, 2.13, 3.12), "Sam", c(T, F, F, T, F))

This list hold three objects: (1) a double vector, (2) a character vector with one value, and (3) a logical vector
I can access the third object using double-bracket indexing

If your list has element names, R also allows you to access its objects using the dollar sign operator

my_list <- list(obj1 = c(1.23, 2.13, 3.12),
                obj2 = "Sam",
                obj3 = c(T, F, F, T, F))
my_list$obj2

Data frames are lists of vectors

In R, datasets are stored as data frame objects
Let’s look at one of R’s built-in data frame objects, mtcars

head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Data frames are themselves lists

typeof(mtcars)

[1] "list"

Each column is an atomic vector object within the list
Column names are the element names of the list

typeof(mtcars$mpg)

[1] "double"

Tibbles are better data frames

Tibbles improve on data frames in many ways, but the most obvious benefit to us is in printing
For a more thorough exploration of the differences between data frames and tibbles call vignette("tibble")
We can use as_tibble() to convert a base R data frame to a tibble

Creating tibbles from…

Data frames
Vectors
Scratch
Files

as_tibble(iris) # iris is a built-in data frame

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

tibble(int_var = c(1L, 2L, 3L),
       chr_var = c("red", "blue", "green"),
       date_var = mdy("10/17/24", "11/5/24", "11/27/24"),
       fct_var = factor(c("dog", "cat", "dog")),
       dbl_var = c(3.14, 2.71, 6.02))

# A tibble: 3 × 5
  int_var chr_var date_var   fct_var dbl_var
    <int> <chr>   <date>     <fct>     <dbl>
1       1 red     2024-10-17 dog        3.14
2       2 blue    2024-11-05 cat        2.71
3       3 green   2024-11-27 dog        6.02

tribble(
  ~int_var, ~chr_var, ~lgl_var,
  1, "red", TRUE,
  2, "blue", FALSE,
  3, "green", TRUE
)

# A tibble: 3 × 3
  int_var chr_var lgl_var
    <dbl> <chr>   <lgl>  
1       1 red     TRUE   
2       2 blue    FALSE  
3       3 green   TRUE

sightings <- read_csv("08_data/sightings.csv")

Note: Data import

We’ll talk more about importing data from files in a future lecture.

Visual summary

* Note: R supports more than just atomic vector and list objects