Functionals

Sam Mason

Learning goals

Following this lecture, students should be able to:

  1. Understand how the map() functional (and its atomic variants) works
  2. Apply the map() functional (and its atomic variants) to accomplish iteration tasks
  3. Understand the difference between map() and walk()

Packages

library(tidyverse)
library(nycflights23) # not nycflights13

Review: mutating and summarizing functions

  • Mutating functions take a vector and return a vector of the same length

  • Summarizing functions take a vector as input and return a single value as output

  • We can use the across() function (itself a functional, btw) to apply these functions to multiple columns “at once”

The map() functional

Big idea: the map() function

map() is the tidyverse’s Platonic, general-purpose functional. It takes a vector (atomic or list) and a function, and applies the function to each element of the vector, returning a list with same number of elements.

  • Let’s see map() in action!
x <- 1:10
l <- map(.x = x, # the input vector to apply the function to
         .f = \(x) if_else(x %% 2 == 0, "even", "odd")) # the function
length(l) == length(x) # same length as x
[1] TRUE
l[1:3] # each element is a character vector with one element
[[1]]
[1] "odd"

[[2]]
[1] "even"

[[3]]
[1] "odd"
  • In general, the map() functional replaces for() loops which iterate through the elements of a vector (or list) and populate some output list
x <- 1:10
l <- list() # output list
for (i in seq_along(x)) {
  if (x[i] %% 2 == 0) {
    l[i] <- "even"
  } else {
    l[i] <- "odd"
  }
}
  • The map() functional has a ... argument that we can use to pass to the function defined by .f=.
x <- c("10-15-24", "10-16-24", "10-17-24", "10-18-24", "10-19-24")
map(.x = x,
    .f = mdy,
    tz = "US/Eastern") # tz= is an argument within mdy()
[[1]]
[1] "2024-10-15 EDT"

[[2]]
[1] "2024-10-16 EDT"

[[3]]
[1] "2024-10-17 EDT"

[[4]]
[1] "2024-10-18 EDT"

[[5]]
[1] "2024-10-19 EDT"
  • It is, however, best practice to specify subordinate function argument by way of an anonymous function
x <- c("10-15-24", "10-16-24", "10-17-24", "10-18-24", "10-19-24")
map(.x = x,
    .f = \(x) mdy(x, tz = 'US/Eastern'))
[[1]]
[1] "2024-10-15 EDT"

[[2]]
[1] "2024-10-16 EDT"

[[3]]
[1] "2024-10-17 EDT"

[[4]]
[1] "2024-10-18 EDT"

[[5]]
[1] "2024-10-19 EDT"

map() within mutate()

  • map() is effectively a mutating function

Vectorized function

mpg |>
  mutate(year = factor(year)) |>
  select(model:year) # for display purposes
# A tibble: 234 Ă— 3
   model      displ year 
   <chr>      <dbl> <fct>
 1 a4           1.8 1999 
 2 a4           1.8 1999 
 3 a4           2   2008 
 4 a4           2   2008 
 5 a4           2.8 1999 
 6 a4           2.8 1999 
 7 a4           3.1 2008 
 8 a4 quattro   1.8 1999 
 9 a4 quattro   1.8 1999 
10 a4 quattro   2   2008 
# ℹ 224 more rows

map()

mpg |>
  mutate(year = map(.x = year,
                    .f = factor)) |>
  select(model:year) # for display purposes
# A tibble: 234 Ă— 3
   model      displ year     
   <chr>      <dbl> <list>   
 1 a4           1.8 <fct [1]>
 2 a4           1.8 <fct [1]>
 3 a4           2   <fct [1]>
 4 a4           2   <fct [1]>
 5 a4           2.8 <fct [1]>
 6 a4           2.8 <fct [1]>
 7 a4           3.1 <fct [1]>
 8 a4 quattro   1.8 <fct [1]>
 9 a4 quattro   1.8 <fct [1]>
10 a4 quattro   2   <fct [1]>
# ℹ 224 more rows

Big idea: list columns and hierarchy

When we introduced data frames and tibbles, we described them as lists of atomic vectors, each atomic vector being a column. These data structures also support list columns! This is our first taste of hierarchical datasets, a concept that we will get more practice with when learn to import JSON data using web APIs.

  • The map() functional always returns a list

  • This behavior ensures that the output is always the same length as the input

sample_sizes <- c(2, 4, 6)
samples <- map(.x = sample_sizes,
               .f = \(x) runif(x, min = 0, max = 1))
samples
[[1]]
[1] 0.931081766 0.002136413

[[2]]
[1] 0.9211738 0.5341741 0.3408999 0.1276429

[[3]]
[1] 0.69030450 0.82069657 0.78606399 0.69244715 0.03834361 0.85531494
  • In situations where each element of a list column contains a single value, we can use purrr::list_c() to concatenate these values into an atomic vector
mpg |>
  mutate(year = map(.x = year,
                    .f = factor),
         year = list_c(year))
# A tibble: 234 Ă— 11
   manufacturer model displ year    cyl trans drv     cty   hwy fl    class
   <chr>        <chr> <dbl> <fct> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4      1.8 1999      4 auto… f        18    29 p     comp…
 2 audi         a4      1.8 1999      4 manu… f        21    29 p     comp…
 3 audi         a4      2   2008      4 manu… f        20    31 p     comp…
 4 audi         a4      2   2008      4 auto… f        21    30 p     comp…
 5 audi         a4      2.8 1999      6 auto… f        16    26 p     comp…
 6 audi         a4      2.8 1999      6 manu… f        18    26 p     comp…
 7 audi         a4      3.1 2008      6 auto… f        18    27 p     comp…
 8 audi         a4 q…   1.8 1999      4 manu… 4        18    26 p     comp…
 9 audi         a4 q…   1.8 1999      4 auto… 4        16    25 p     comp…
10 audi         a4 q…   2   2008      4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Mapping with atomic output

  • map() variants including map_lgl(), map_dbl(), map_int(), and map_chr() output corresponding atomic vectors

  • There is no map_fct(), but map_vec() returns an atomic vector based on the most common type of the list elements

mpg |>
  mutate(year = map_vec(.x = year,
                        .f = factor))
# A tibble: 234 Ă— 11
   manufacturer model displ year    cyl trans drv     cty   hwy fl    class
   <chr>        <chr> <dbl> <fct> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4      1.8 1999      4 auto… f        18    29 p     comp…
 2 audi         a4      1.8 1999      4 manu… f        21    29 p     comp…
 3 audi         a4      2   2008      4 manu… f        20    31 p     comp…
 4 audi         a4      2   2008      4 auto… f        21    30 p     comp…
 5 audi         a4      2.8 1999      6 auto… f        16    26 p     comp…
 6 audi         a4      2.8 1999      6 manu… f        18    26 p     comp…
 7 audi         a4      3.1 2008      6 auto… f        18    27 p     comp…
 8 audi         a4 q…   1.8 1999      4 manu… 4        18    26 p     comp…
 9 audi         a4 q…   1.8 1999      4 auto… 4        16    25 p     comp…
10 audi         a4 q…   2   2008      4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

In-class exercises

The following exercises use the nycflights23::flights dataset.

  1. Use map_dbl() to convert all values of distance from miles to feet. There are 5,280 feet in one mile.
  2. Use map_chr() to make all the letters in tailnum lowercase using str_to_lower().
# exercise 1
flights |>
  mutate(distance_ft = map_dbl(.x = distance,
                               .f = \(x) x * 5280))

# exercise 2
flights |>
  mutate(tailnum = map_chr(.x = tailnum,
                           .f = str_to_lower))

Why use map() at all?

Motivating question

In a language like R, where we get so much iteration “for free” (through vectorized functions), why would we need to use map() and it’s variants at all?

  1. Progress bars for large datasets and/or complex operations
  2. Parallelization of operations (using {furrr}; not discussed in this lecture)
  3. Importing multiple datasets from files

Progress bars for complex operations

Note: this example is somewhat contrived

To illustrate the utility of the progress bar, I’ve concocted a scenario that you will likely never find useful in your data science careers (unless you become an introductory statistics professor). Time-consuming operations like this are common, but relate to tasks that we haven’t gotten to yet (e.g., scraping columns of URLs, querying APIs based on columns of search parameters), or won’t ever cover in this class (e.g., sensitivity analysis in probabilistic modeling, bootstrapping, etc.)

  • Copy and paste the code below into an RStudio session with the {tidyverse} loaded in, then run it.
tibble(size = 1:50000) |>
  mutate(sample_mean = map_dbl(
    .x = size,
    .f = \(x) mean(rnorm(x, mean = 0, sd = 1)),
    .progress = TRUE)) |>
  ggplot(aes(x = size,
             y = sample_mean)) +
  geom_line(linewidth = 0.1) +
  geom_hline(yintercept = 0,
             color = "gray",
             linetype = 2) +
  labs(title = "The law of large numbers",
       x = "Sample size",
       y = "Sample mean")

Importing many files

Motivating task

I have six .csv files that report data on organizational leadership graduate degree conferrals where each file corresponds to a different academic year (2018 to 2023). I’d like to import them all, and then combine them into a single dataset.

list.files(path = "./13_data")
 [1] "org_lead_2018.csv"  "org_lead_2018.html" "org_lead_2019.csv" 
 [4] "org_lead_2019.html" "org_lead_2020.csv"  "org_lead_2020.html"
 [7] "org_lead_2021.csv"  "org_lead_2021.html" "org_lead_2022.csv" 
[10] "org_lead_2022.html" "org_lead_2023.csv"  "org_lead_2023.html"
files <- list.files(path = "./13_data",
                    pattern = "\\.csv$") # ending in .csv
  • The read_csv() function is vectorized — it can accept a list of file paths

  • It will automatically attempt to combine each file by row

read_csv(file = str_c("./13_data/", files))
Error: Files must have consistent column names:
* File 1 column 4 is: C2018_A.First or Second Major
* File 2 column 4 is: C2019_A.First or Second Major
  • We get an error because the column names do not match

  • Instead of using the vectorized read_csv(), let’s solve this problem with map()

dfs <- map(.x = str_c("./13_data/", files),
           .f = read_csv)
dfs[[1]] |> select(1:3) # for display purposes
# A tibble: 136 Ă— 3
   unitid `institution name`              year
    <dbl> <chr>                          <dbl>
 1 100690 Amridge University              2018
 2 102669 Alaska Pacific University       2018
 3 107141 John Brown University           2018
 4 110361 California Baptist University   2018
 5 112075 Concordia University-Irvine     2018
 6 119173 Mount Saint Mary's University   2018
 7 119605 National University             2018
 8 121150 Pepperdine University           2018
 9 121309 Point Loma Nazarene University  2018
10 121691 University of Redlands          2018
# ℹ 126 more rows
  • I can now iterate through each tibble in the dfs list to standardize the column names

  • First, I’ll make a function to help standardize names

standardize_names <- function(df) {
  colnames(df) <- str_remove(string = colnames(df),
                             pattern = "^.+\\.")
  return(df) # because the line above doesn't return anything
}
  • Next I’ll apply this function to each element of dfs and assign the output to a new list
clean_dfs <- map(.x = dfs,
                 .f = standardize_names)
  • Finally, I’ll combine the tibbles by row
org_lead_confs <- list_rbind(clean_dfs)
org_lead_confs |>
  slice_sample(n = 10) |>
  select(1:3, `Grand total`)
# A tibble: 10 Ă— 4
   unitid `institution name`                         year `Grand total`
    <dbl> <chr>                                     <dbl>         <dbl>
 1 178615 Truman State University                    2021             6
 2 149514 Trinity International University-Illinois  2021            10
 3 179159 Saint Louis University                     2018            14
 4 480569 Florida Institute of Technology-Online     2022            20
 5 136330 Palm Beach Atlantic University             2020            32
 6 489937 Carolina University                        2022             8
 7 228787 The University of Texas at Dallas          2023            34
 8 127918 Regis University                           2019            75
 9 152336 University of Saint Francis-Fort Wayne     2021            22
10 162928 Johns Hopkins University                   2021             0

The walk() functional

  • Functions in R tend to return some sort of output

  • Certain function also have side effects — they do something else other than returning output, for example:

    • change variables in the environment

    • plot graphics

    • save data to delimited text files

  • The walk() functional is used when all we care about are the side effects of a function

Saving multiple plots to file