Functionals

Sam Mason

Learning goals

Following this lecture, students should be able to:

Understand how the map() functional (and its atomic variants) works
Apply the map() functional (and its atomic variants) to accomplish iteration tasks
Understand the difference between map() and walk()

Packages

library(tidyverse)
library(nycflights23) # not nycflights13

The {tidyverse} metapackage includes a new-to-us packages called {purrr}
{purrr} enhances R’s functional programming toolkit

Review: mutating and summarizing functions

Mutating functions take a vector and return a vector of the same length
Summarizing functions take a vector as input and return a single value as output
We can use the across() function (itself a functional, btw) to apply these functions to multiple columns “at once”

The `map()` functional

Big idea: the map() function

map() is the tidyverse’s Platonic, general-purpose functional. It takes a vector (atomic or list) and a function, and applies the function to each element of the vector, returning a list with same number of elements.

Let’s see map() in action!

x <- 1:10
l <- map(.x = x, # the input vector to apply the function to
         .f = \(x) if_else(x %% 2 == 0, "even", "odd")) # the function
length(l) == length(x) # same length as x

[1] TRUE

l[1:3] # each element is a character vector with one element

[[1]]
[1] "odd"

[[2]]
[1] "even"

[[3]]
[1] "odd"

In general, the map() functional replaces for() loops which iterate through the elements of a vector (or list) and populate some output list

x <- 1:10
l <- list() # output list
for (i in seq_along(x)) {
  if (x[i] %% 2 == 0) {
    l[i] <- "even"
  } else {
    l[i] <- "odd"
  }
}

The map() functional has a ... argument that we can use to pass to the function defined by .f=.

x <- c("10-15-24", "10-16-24", "10-17-24", "10-18-24", "10-19-24")
map(.x = x,
    .f = mdy,
    tz = "US/Eastern") # tz= is an argument within mdy()

[[1]]
[1] "2024-10-15 EDT"

[[2]]
[1] "2024-10-16 EDT"

[[3]]
[1] "2024-10-17 EDT"

[[4]]
[1] "2024-10-18 EDT"

[[5]]
[1] "2024-10-19 EDT"

It is, however, best practice to specify subordinate function argument by way of an anonymous function

x <- c("10-15-24", "10-16-24", "10-17-24", "10-18-24", "10-19-24")
map(.x = x,
    .f = \(x) mdy(x, tz = 'US/Eastern'))

[[1]]
[1] "2024-10-15 EDT"

[[2]]
[1] "2024-10-16 EDT"

[[3]]
[1] "2024-10-17 EDT"

[[4]]
[1] "2024-10-18 EDT"

[[5]]
[1] "2024-10-19 EDT"

`map()` within `mutate()`

map() is effectively a mutating function

Vectorized function

mpg |>
  mutate(year = factor(year)) |>
  select(model:year) # for display purposes

# A tibble: 234 × 3
   model      displ year 
   <chr>      <dbl> <fct>
 1 a4           1.8 1999 
 2 a4           1.8 1999 
 3 a4           2   2008 
 4 a4           2   2008 
 5 a4           2.8 1999 
 6 a4           2.8 1999 
 7 a4           3.1 2008 
 8 a4 quattro   1.8 1999 
 9 a4 quattro   1.8 1999 
10 a4 quattro   2   2008 
# ℹ 224 more rows

map()

mpg |>
  mutate(year = map(.x = year,
                    .f = factor)) |>
  select(model:year) # for display purposes

# A tibble: 234 × 3
   model      displ year     
   <chr>      <dbl> <list>   
 1 a4           1.8 <fct [1]>
 2 a4           1.8 <fct [1]>
 3 a4           2   <fct [1]>
 4 a4           2   <fct [1]>
 5 a4           2.8 <fct [1]>
 6 a4           2.8 <fct [1]>
 7 a4           3.1 <fct [1]>
 8 a4 quattro   1.8 <fct [1]>
 9 a4 quattro   1.8 <fct [1]>
10 a4 quattro   2   <fct [1]>
# ℹ 224 more rows

Big idea: list columns and hierarchy

When we introduced data frames and tibbles, we described them as lists of atomic vectors, each atomic vector being a column. These data structures also support list columns! This is our first taste of hierarchical datasets, a concept that we will get more practice with when learn to import JSON data using web APIs.

The map() functional always returns a list
This behavior ensures that the output is always the same length as the input

sample_sizes <- c(2, 4, 6)
samples <- map(.x = sample_sizes,
               .f = \(x) runif(x, min = 0, max = 1))
samples

[[1]]
[1] 0.931081766 0.002136413

[[2]]
[1] 0.9211738 0.5341741 0.3408999 0.1276429

[[3]]
[1] 0.69030450 0.82069657 0.78606399 0.69244715 0.03834361 0.85531494

In situations where each element of a list column contains a single value, we can use purrr::list_c() to concatenate these values into an atomic vector

mpg |>
  mutate(year = map(.x = year,
                    .f = factor),
         year = list_c(year))

# A tibble: 234 × 11
   manufacturer model displ year    cyl trans drv     cty   hwy fl    class
   <chr>        <chr> <dbl> <fct> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4      1.8 1999      4 auto… f        18    29 p     comp…
 2 audi         a4      1.8 1999      4 manu… f        21    29 p     comp…
 3 audi         a4      2   2008      4 manu… f        20    31 p     comp…
 4 audi         a4      2   2008      4 auto… f        21    30 p     comp…
 5 audi         a4      2.8 1999      6 auto… f        16    26 p     comp…
 6 audi         a4      2.8 1999      6 manu… f        18    26 p     comp…
 7 audi         a4      3.1 2008      6 auto… f        18    27 p     comp…
 8 audi         a4 q…   1.8 1999      4 manu… 4        18    26 p     comp…
 9 audi         a4 q…   1.8 1999      4 auto… 4        16    25 p     comp…
10 audi         a4 q…   2   2008      4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Mapping with atomic output

map() variants including map_lgl(), map_dbl(), map_int(), and map_chr() output corresponding atomic vectors
There is no map_fct(), but map_vec() returns an atomic vector based on the most common type of the list elements

mpg |>
  mutate(year = map_vec(.x = year,
                        .f = factor))

# A tibble: 234 × 11
   manufacturer model displ year    cyl trans drv     cty   hwy fl    class
   <chr>        <chr> <dbl> <fct> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4      1.8 1999      4 auto… f        18    29 p     comp…
 2 audi         a4      1.8 1999      4 manu… f        21    29 p     comp…
 3 audi         a4      2   2008      4 manu… f        20    31 p     comp…
 4 audi         a4      2   2008      4 auto… f        21    30 p     comp…
 5 audi         a4      2.8 1999      6 auto… f        16    26 p     comp…
 6 audi         a4      2.8 1999      6 manu… f        18    26 p     comp…
 7 audi         a4      3.1 2008      6 auto… f        18    27 p     comp…
 8 audi         a4 q…   1.8 1999      4 manu… 4        18    26 p     comp…
 9 audi         a4 q…   1.8 1999      4 auto… 4        16    25 p     comp…
10 audi         a4 q…   2   2008      4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

In-class exercises

The following exercises use the nycflights23::flights dataset.

Exercises
Solutions

Use map_dbl() to convert all values of distance from miles to feet. There are 5,280 feet in one mile.
Use map_chr() to make all the letters in tailnum lowercase using str_to_lower().

# exercise 1
flights |>
  mutate(distance_ft = map_dbl(.x = distance,
                               .f = \(x) x * 5280))

# exercise 2
flights |>
  mutate(tailnum = map_chr(.x = tailnum,
                           .f = str_to_lower))

Why use `map()` at all?

Motivating question

In a language like R, where we get so much iteration “for free” (through vectorized functions), why would we need to use map() and it’s variants at all?

Progress bars for large datasets and/or complex operations
Parallelization of operations (using {furrr}; not discussed in this lecture)
Importing multiple datasets from files

Progress bars for complex operations

Note: this example is somewhat contrived

To illustrate the utility of the progress bar, I’ve concocted a scenario that you will likely never find useful in your data science careers (unless you become an introductory statistics professor). Time-consuming operations like this are common, but relate to tasks that we haven’t gotten to yet (e.g., scraping columns of URLs, querying APIs based on columns of search parameters), or won’t ever cover in this class (e.g., sensitivity analysis in probabilistic modeling, bootstrapping, etc.)

Copy and paste the code below into an RStudio session with the {tidyverse} loaded in, then run it.

tibble(size = 1:50000) |>
  mutate(sample_mean = map_dbl(
    .x = size,
    .f = \(x) mean(rnorm(x, mean = 0, sd = 1)),
    .progress = TRUE)) |>
  ggplot(aes(x = size,
             y = sample_mean)) +
  geom_line(linewidth = 0.1) +
  geom_hline(yintercept = 0,
             color = "gray",
             linetype = 2) +
  labs(title = "The law of large numbers",
       x = "Sample size",
       y = "Sample mean")

Importing many files

Motivating task

I have six .csv files that report data on organizational leadership graduate degree conferrals where each file corresponds to a different academic year (2018 to 2023). I’d like to import them all, and then combine them into a single dataset.

list.files(path = "./13_data")

 [1] "org_lead_2018.csv"  "org_lead_2018.html" "org_lead_2019.csv" 
 [4] "org_lead_2019.html" "org_lead_2020.csv"  "org_lead_2020.html"
 [7] "org_lead_2021.csv"  "org_lead_2021.html" "org_lead_2022.csv" 
[10] "org_lead_2022.html" "org_lead_2023.csv"  "org_lead_2023.html"

files <- list.files(path = "./13_data",
                    pattern = "\\.csv$") # ending in .csv

The read_csv() function is vectorized — it can accept a list of file paths
It will automatically attempt to combine each file by row

read_csv(file = str_c("./13_data/", files))

Error: Files must have consistent column names:
* File 1 column 4 is: C2018_A.First or Second Major
* File 2 column 4 is: C2019_A.First or Second Major

We get an error because the column names do not match
Instead of using the vectorized read_csv(), let’s solve this problem with map()

dfs <- map(.x = str_c("./13_data/", files),
           .f = read_csv)
dfs[[1]] |> select(1:3) # for display purposes

# A tibble: 136 × 3
   unitid `institution name`              year
    <dbl> <chr>                          <dbl>
 1 100690 Amridge University              2018
 2 102669 Alaska Pacific University       2018
 3 107141 John Brown University           2018
 4 110361 California Baptist University   2018
 5 112075 Concordia University-Irvine     2018
 6 119173 Mount Saint Mary's University   2018
 7 119605 National University             2018
 8 121150 Pepperdine University           2018
 9 121309 Point Loma Nazarene University  2018
10 121691 University of Redlands          2018
# ℹ 126 more rows

I can now iterate through each tibble in the dfs list to standardize the column names
First, I’ll make a function to help standardize names

standardize_names <- function(df) {
  colnames(df) <- str_remove(string = colnames(df),
                             pattern = "^.+\\.")
  return(df) # because the line above doesn't return anything
}

Next I’ll apply this function to each element of dfs and assign the output to a new list

clean_dfs <- map(.x = dfs,
                 .f = standardize_names)

Finally, I’ll combine the tibbles by row

org_lead_confs <- list_rbind(clean_dfs)
org_lead_confs |>
  slice_sample(n = 10) |>
  select(1:3, `Grand total`)

# A tibble: 10 × 4
   unitid `institution name`                         year `Grand total`
    <dbl> <chr>                                     <dbl>         <dbl>
 1 178615 Truman State University                    2021             6
 2 149514 Trinity International University-Illinois  2021            10
 3 179159 Saint Louis University                     2018            14
 4 480569 Florida Institute of Technology-Online     2022            20
 5 136330 Palm Beach Atlantic University             2020            32
 6 489937 Carolina University                        2022             8
 7 228787 The University of Texas at Dallas          2023            34
 8 127918 Regis University                           2019            75
 9 152336 University of Saint Francis-Fort Wayne     2021            22
10 162928 Johns Hopkins University                   2021             0

The `walk()` functional

Functions in R tend to return some sort of output
Certain function also have side effects — they do something else other than returning output, for example:
- change variables in the environment
- plot graphics
- save data to delimited text files
The walk() functional is used when all we care about are the side effects of a function