Vectors, Lists, and Tibbles

Sam Mason

Learning goals

Following this lecture, you should be able to:

  • Distinguish between atomic vectors and lists

  • Identify the four fundamental atomic vector types, and the three “complex” atomic vector types

  • Work with (create, mutate, summarize) each of the seven atomic vector types

  • Describe the relationship between atomic vectors, lists, data frames, and tibbles

  • Create tibbles

Packages

library(tidyverse) # including {stringr}, {lubridate}, and {tibble}
library(palmerpenguins) # for the penguins dataset
library(nycflights13) # for the flights dataset
  • The {stringr} package helps us work with character vectors

  • The {lubridate} package helps us work with date and date-time vectors

  • The {tibble} package allows us to work with tibbles

First, a bit of base R

  • Before discussing tibbles, we need to learn some R basics

    • What is an atomic vector?

    • What is a list?

    • What is a data frame?

Big idea: What are tibbles?

A data frame is a list of atomic vectors, and tibbles are just “better” data frames.

Vectors

  • R supports two fundamental types of vector

    • Atomic vectors (simply called “vectors”)

    • Lists

  • Both are data structures that hold many elements

The four fundamental atomics

  • Logical: c(TRUE, FALSE)
  • Integer: c(1L, 2L, 3L)
  • Double: c(1.2, 3.4)
  • Character: c("D", "S", "C")
  • Numeric: ints and dbls

Tip: Checking vector types

We can use the typeof() function on a vector to return the type.

Big idea: Atomic vectors are homogeneous

All elements of a given atomic vector must be of the same type.

Converting types

  • Functions like as.numeric(), as.logical(), etc.

  • Certain conversion are not possible (e.g., words to numbers), or give undesirable results (e.g., decimals to integers)

Working with logical vectors

  • Logical vectors are the result of Boolean algebra
  • We can use can use Boolean algebra in mutate() to create indicator variables
penguins |>
  drop_na(species, sex, body_mass_g) |>
  group_by(species, sex) |>
  mutate(overweight = body_mass_g > (mean(body_mass_g) + (sd(body_mass_g)))) |>
  slice_sample(n = 2) |> # two random obs from each group
  select(species, sex, body_mass_g, overweight)
# A tibble: 12 Ă— 4
# Groups:   species, sex [6]
   species   sex    body_mass_g overweight
   <fct>     <fct>        <int> <lgl>     
 1 Adelie    female        3800 TRUE      
 2 Adelie    female        2925 FALSE     
 3 Adelie    male          4700 TRUE      
 4 Adelie    male          4300 FALSE     
 5 Chinstrap female        3700 FALSE     
 6 Chinstrap female        3300 FALSE     
 7 Chinstrap male          4050 FALSE     
 8 Chinstrap male          3950 FALSE     
 9 Gentoo    female        4300 FALSE     
10 Gentoo    female        3950 FALSE     
11 Gentoo    male          5700 FALSE     
12 Gentoo    male          5000 FALSE     
  • The functions any() and all() can be used to summarize a logical vector into a logical scalar
  • We can use these functions in summarize() to summarize indicator variables
penguins |>
  drop_na(species, sex, body_mass_g) |>
  group_by(species, sex) |>
  mutate(obese = body_mass_g > (mean(body_mass_g) + (2 * sd(body_mass_g)))) |>
  summarize(obese = any(obese))
# A tibble: 6 Ă— 3
# Groups:   species [3]
  species   sex    obese
  <fct>     <fct>  <lgl>
1 Adelie    female FALSE
2 Adelie    male   TRUE 
3 Chinstrap female TRUE 
4 Chinstrap male   TRUE 
5 Gentoo    female FALSE
6 Gentoo    male   TRUE 

In-class exercises

  1. Create a new indicator variable in nycflights13::flights called “red_eye” that is TRUE when the flights departs between 22:00 and 23:59, and arrives no later than 05:00.
  2. Did any() of the NYC airports have red eye flights on the 4th of July?. Make sure to drop_na() from the dep_time and arr_time columns first.
# exercise 1
flights |>
  mutate(red_eye = between(dep_time, 2200, 2359) & arr_time <= 500,
         .keep = "used")

# exercise 2
flights |>
  drop_na(dep_time, arr_time) |>
  mutate(red_eye = between(dep_time, 2200, 2359) & arr_time <= 500) |>
  filter(month == 7, day == 4) |>
  group_by(origin) |>
  summarize(any_red_eyes = any(red_eye))

Working with numeric vectors

Note: We’ve been here before

We’ve already discussed several ways in which we can mutate() and summarize() numbers. Here are even more cool things we can do!

  • We can reduce the precision of doubles using the round() function
  • We can calculate cumulative sums and means using cumsum() and cummean()
  • Cumulative arithmetic can be easily applied within mutate()

  • Using the quantile() function we can find values at or below which a given percentage of the data falls
  • The quantile() function works well within summarize() to mark important positions in the distribution of a given variable
penguins |>
  drop_na(species, sex) |>
  group_by(species, sex) |>
  summarize(flipper_0.1 = quantile(flipper_length_mm,
                                   probs = 0.1,
                                   na.rm = TRUE))
# A tibble: 6 Ă— 3
# Groups:   species [3]
  species   sex    flipper_0.1
  <fct>     <fct>        <dbl>
1 Adelie    female        181 
2 Adelie    male          184 
3 Chinstrap female        186.
4 Chinstrap male          193 
5 Gentoo    female        208 
6 Gentoo    male          215 

In-class exercises

  1. Calculate the cumulative sum of departure delay minutes for December 25th. Make sure the rows are in chronological order first! You’ll want to drop_na() from the departure delay column first.
  2. Round the cumulative sum to the nearest hundred minutes (the digits= argument can be negative).
  3. What is the 0.99 quantile of the distance variable? That is, what is the value of distance below which 99% of the flights fall?
# exercise 3
flights |>
  filter(month == 12, day == 25) |>
  drop_na(dep_delay) |>
  arrange(dep_time) |>
  mutate(cumul_dep_delay = cumsum(dep_delay))

# exercise 4
flights |>
  filter(month == 12, day == 25) |>
  drop_na(dep_delay) |>
  arrange(dep_time) |>
  mutate(cumul_dep_delay = cumsum(dep_delay),
         cumul_dep_delay = round(cumul_dep_delay, digits = -2))

# exercise 5
flights |>
  summarize(distance_0.99 = quantile(distance, probs = 0.99))

Working with character vectors

Tip: The {stringr} package

The {stringr} package (included in {tidyverse}) makes working with character vectors easy!

  • The fruit character vector is a built-in data object from {stringr}
fruit[1:10]
 [1] "apple"        "apricot"      "avocado"      "banana"       "bell pepper" 
 [6] "bilberry"     "blackberry"   "blackcurrant" "blood orange" "blueberry"   

Tip: The {stringr} cheatsheet

We’ve only scratched the surface of what the {stringr} package can help us to do! If you want to read more, check out the {stringr} cheatsheet.

More complex atomic vectors

  • In R, categorical variables are stored as factors

  • Calendar dates are stored as date objects

  • Specific moments in time (date-times) are stored as POSIXct objects

  • Each built “on top of” a fundamental atomic

Creating factors

  • We create factors using the factor() function

  • By default, factor levels are listed in ascending order (either alphabetically or numerically)

  • Ordinals require proper level ordering through the levels= argument
chr_vec <- c("good", "okay", "okay", "very good", "great")
factor(chr_vec,
       levels = c("okay", "good", "very good", "great"),
       ordered = TRUE)
[1] good      okay      okay      very good great    
Levels: okay < good < very good < great

Creating dates and date-times

Tip: The {lubridate} package

The {lubridate} package (included in {tidyverse}) brings order to the chaos (e.g., timezones, leap days, daylight saving time, etc.) that swirls around humankind’s time-keeping systems.

  • These functions work well when we have time component variables in our dataset
storms |>
  mutate(date = make_date(year = year, month = month, day = day),
         date_time = make_datetime(year = year, month = month,
                                   day = day, hour = hour),
         .keep = "used")
# A tibble: 19,537 Ă— 6
    year month   day  hour date       date_time          
   <dbl> <dbl> <int> <dbl> <date>     <dttm>             
 1  1975     6    27     0 1975-06-27 1975-06-27 00:00:00
 2  1975     6    27     6 1975-06-27 1975-06-27 06:00:00
 3  1975     6    27    12 1975-06-27 1975-06-27 12:00:00
 4  1975     6    27    18 1975-06-27 1975-06-27 18:00:00
 5  1975     6    28     0 1975-06-28 1975-06-28 00:00:00
 6  1975     6    28     6 1975-06-28 1975-06-28 06:00:00
 7  1975     6    28    12 1975-06-28 1975-06-28 12:00:00
 8  1975     6    28    18 1975-06-28 1975-06-28 18:00:00
 9  1975     6    29     0 1975-06-29 1975-06-29 00:00:00
10  1975     6    29     6 1975-06-29 1975-06-29 06:00:00
# ℹ 19,527 more rows
  • There are many different variations of mdy() (e.g., ymd()), including those that can parse strings with both date and time information
sightings |>
  mutate(date_time = mdy_hm(datetime)) |>
  select(datetime, date_time, state, shape)
# A tibble: 80,332 Ă— 4
   datetime         date_time           state shape   
   <chr>            <dttm>              <chr> <chr>   
 1 10/10/1949 20:30 1949-10-10 20:30:00 tx    cylinder
 2 10/10/1949 21:00 1949-10-10 21:00:00 tx    light   
 3 10/10/1955 17:00 1955-10-10 17:00:00 <NA>  circle  
 4 10/10/1956 21:00 1956-10-10 21:00:00 tx    circle  
 5 10/10/1960 20:00 1960-10-10 20:00:00 hi    light   
 6 10/10/1961 19:00 1961-10-10 19:00:00 tn    sphere  
 7 10/10/1965 21:00 1965-10-10 21:00:00 <NA>  circle  
 8 10/10/1965 23:45 1965-10-10 23:45:00 ct    disk    
 9 10/10/1966 20:00 1966-10-10 20:00:00 al    disk    
10 10/10/1966 21:00 1966-10-10 21:00:00 fl    disk    
# ℹ 80,322 more rows

Lists

  • Unlike atomic vectors, lists do not need to be homogeneous

  • Unlike atomic vectors, the elements of a list aren’t values (scalars), but entire objects

  • We index lists a bit differently than we do vectors

  • Let’s create a list that contains a heterogeneous collection of elements

my_list <- list(c(1.23, 2.13, 3.12), "Sam", c(T, F, F, T, F))
  • This list hold three objects: (1) a double vector, (2) a character vector with one value, and (3) a logical vector

  • I can access the third object using double-bracket indexing

  • If your list has element names, R also allows you to access its objects using the dollar sign operator
my_list <- list(obj1 = c(1.23, 2.13, 3.12),
                obj2 = "Sam",
                obj3 = c(T, F, F, T, F))
my_list$obj2

Data frames are lists of vectors

  • In R, datasets are stored as data frame objects

  • Let’s look at one of R’s built-in data frame objects, mtcars

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  • Data frames are themselves lists
typeof(mtcars)
[1] "list"
  • Each column is an atomic vector object within the list

  • Column names are the element names of the list

typeof(mtcars$mpg)
[1] "double"

Tibbles are better data frames

  • Tibbles improve on data frames in many ways, but the most obvious benefit to us is in printing

  • For a more thorough exploration of the differences between data frames and tibbles call vignette("tibble")

  • We can use as_tibble() to convert a base R data frame to a tibble

Creating tibbles from…

as_tibble(iris) # iris is a built-in data frame
# A tibble: 150 Ă— 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows
tibble(int_var = c(1L, 2L, 3L),
       chr_var = c("red", "blue", "green"),
       date_var = mdy("10/17/24", "11/5/24", "11/27/24"),
       fct_var = factor(c("dog", "cat", "dog")),
       dbl_var = c(3.14, 2.71, 6.02))
# A tibble: 3 Ă— 5
  int_var chr_var date_var   fct_var dbl_var
    <int> <chr>   <date>     <fct>     <dbl>
1       1 red     2024-10-17 dog        3.14
2       2 blue    2024-11-05 cat        2.71
3       3 green   2024-11-27 dog        6.02
tribble(
  ~int_var, ~chr_var, ~lgl_var,
  1, "red", TRUE,
  2, "blue", FALSE,
  3, "green", TRUE
)
# A tibble: 3 Ă— 3
  int_var chr_var lgl_var
    <dbl> <chr>   <lgl>  
1       1 red     TRUE   
2       2 blue    FALSE  
3       3 green   TRUE   
sightings <- read_csv("08_data/sightings.csv")

Note: Data import

We’ll talk more about importing data from files in a future lecture.

Visual summary

* Note: R supports more than just atomic vector and list objects