Following this lecture, students should be able to:
map() functional (and its atomic variants) worksmap() functional (and its atomic variants) to accomplish iteration tasksmap() and walk()The {tidyverse} metapackage includes a new-to-us packages called {purrr}
Mutating functions take a vector and return a vector of the same length
Summarizing functions take a vector as input and return a single value as output
We can use the across() function (itself a functional, btw) to apply these functions to multiple columns “at once”
map() functionalBig idea: the map() function
map() is the tidyverse’s Platonic, general-purpose functional. It takes a vector (atomic or list) and a function, and applies the function to each element of the vector, returning a list with same number of elements.

map() in action!x <- 1:10
l <- map(.x = x, # the input vector to apply the function to
         .f = \(x) if_else(x %% 2 == 0, "even", "odd")) # the function
length(l) == length(x) # same length as x[1] TRUE[[1]]
[1] "odd"
[[2]]
[1] "even"
[[3]]
[1] "odd"map() functional replaces for() loops which iterate through the elements of a vector (or list) and populate some output listmap() functional has a ... argument that we can use to pass to the function defined by .f=.x <- c("10-15-24", "10-16-24", "10-17-24", "10-18-24", "10-19-24")
map(.x = x,
    .f = mdy,
    tz = "US/Eastern") # tz= is an argument within mdy()[[1]]
[1] "2024-10-15 EDT"
[[2]]
[1] "2024-10-16 EDT"
[[3]]
[1] "2024-10-17 EDT"
[[4]]
[1] "2024-10-18 EDT"
[[5]]
[1] "2024-10-19 EDT"map() within mutate()map() is effectively a mutating functionVectorized function
# A tibble: 234 Ă— 3
   model      displ year 
   <chr>      <dbl> <fct>
 1 a4           1.8 1999 
 2 a4           1.8 1999 
 3 a4           2   2008 
 4 a4           2   2008 
 5 a4           2.8 1999 
 6 a4           2.8 1999 
 7 a4           3.1 2008 
 8 a4 quattro   1.8 1999 
 9 a4 quattro   1.8 1999 
10 a4 quattro   2   2008 
# ℹ 224 more rowsmap()
# A tibble: 234 Ă— 3
   model      displ year     
   <chr>      <dbl> <list>   
 1 a4           1.8 <fct [1]>
 2 a4           1.8 <fct [1]>
 3 a4           2   <fct [1]>
 4 a4           2   <fct [1]>
 5 a4           2.8 <fct [1]>
 6 a4           2.8 <fct [1]>
 7 a4           3.1 <fct [1]>
 8 a4 quattro   1.8 <fct [1]>
 9 a4 quattro   1.8 <fct [1]>
10 a4 quattro   2   <fct [1]>
# ℹ 224 more rowsBig idea: list columns and hierarchy
When we introduced data frames and tibbles, we described them as lists of atomic vectors, each atomic vector being a column. These data structures also support list columns! This is our first taste of hierarchical datasets, a concept that we will get more practice with when learn to import JSON data using web APIs.
The map() functional always returns a list
This behavior ensures that the output is always the same length as the input
sample_sizes <- c(2, 4, 6)
samples <- map(.x = sample_sizes,
               .f = \(x) runif(x, min = 0, max = 1))
samples[[1]]
[1] 0.931081766 0.002136413
[[2]]
[1] 0.9211738 0.5341741 0.3408999 0.1276429
[[3]]
[1] 0.69030450 0.82069657 0.78606399 0.69244715 0.03834361 0.85531494purrr::list_c() to concatenate these values into an atomic vector# A tibble: 234 Ă— 11
   manufacturer model displ year    cyl trans drv     cty   hwy fl    class
   <chr>        <chr> <dbl> <fct> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4      1.8 1999      4 auto… f        18    29 p     comp…
 2 audi         a4      1.8 1999      4 manu… f        21    29 p     comp…
 3 audi         a4      2   2008      4 manu… f        20    31 p     comp…
 4 audi         a4      2   2008      4 auto… f        21    30 p     comp…
 5 audi         a4      2.8 1999      6 auto… f        16    26 p     comp…
 6 audi         a4      2.8 1999      6 manu… f        18    26 p     comp…
 7 audi         a4      3.1 2008      6 auto… f        18    27 p     comp…
 8 audi         a4 q…   1.8 1999      4 manu… 4        18    26 p     comp…
 9 audi         a4 q…   1.8 1999      4 auto… 4        16    25 p     comp…
10 audi         a4 q…   2   2008      4 manu… 4        20    28 p     comp…
# ℹ 224 more rowsmap() variants including map_lgl(), map_dbl(), map_int(), and map_chr() output corresponding atomic vectors
There is no map_fct(), but map_vec() returns an atomic vector based on the most common type of the list elements
# A tibble: 234 Ă— 11
   manufacturer model displ year    cyl trans drv     cty   hwy fl    class
   <chr>        <chr> <dbl> <fct> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4      1.8 1999      4 auto… f        18    29 p     comp…
 2 audi         a4      1.8 1999      4 manu… f        21    29 p     comp…
 3 audi         a4      2   2008      4 manu… f        20    31 p     comp…
 4 audi         a4      2   2008      4 auto… f        21    30 p     comp…
 5 audi         a4      2.8 1999      6 auto… f        16    26 p     comp…
 6 audi         a4      2.8 1999      6 manu… f        18    26 p     comp…
 7 audi         a4      3.1 2008      6 auto… f        18    27 p     comp…
 8 audi         a4 q…   1.8 1999      4 manu… 4        18    26 p     comp…
 9 audi         a4 q…   1.8 1999      4 auto… 4        16    25 p     comp…
10 audi         a4 q…   2   2008      4 manu… 4        20    28 p     comp…
# ℹ 224 more rowsThe following exercises use the nycflights23::flights dataset.
map_dbl() to convert all values of distance from miles to feet. There are 5,280 feet in one mile.map_chr() to make all the letters in tailnum lowercase using str_to_lower().map() at all?Motivating question
In a language like R, where we get so much iteration “for free” (through vectorized functions), why would we need to use map() and it’s variants at all?
{furrr}; not discussed in this lecture)Note: this example is somewhat contrived
To illustrate the utility of the progress bar, I’ve concocted a scenario that you will likely never find useful in your data science careers (unless you become an introductory statistics professor). Time-consuming operations like this are common, but relate to tasks that we haven’t gotten to yet (e.g., scraping columns of URLs, querying APIs based on columns of search parameters), or won’t ever cover in this class (e.g., sensitivity analysis in probabilistic modeling, bootstrapping, etc.)
{tidyverse} loaded in, then run it.tibble(size = 1:50000) |>
  mutate(sample_mean = map_dbl(
    .x = size,
    .f = \(x) mean(rnorm(x, mean = 0, sd = 1)),
    .progress = TRUE)) |>
  ggplot(aes(x = size,
             y = sample_mean)) +
  geom_line(linewidth = 0.1) +
  geom_hline(yintercept = 0,
             color = "gray",
             linetype = 2) +
  labs(title = "The law of large numbers",
       x = "Sample size",
       y = "Sample mean")Motivating task
I have six .csv files that report data on organizational leadership graduate degree conferrals where each file corresponds to a different academic year (2018 to 2023). I’d like to import them all, and then combine them into a single dataset.
 [1] "org_lead_2018.csv"  "org_lead_2018.html" "org_lead_2019.csv" 
 [4] "org_lead_2019.html" "org_lead_2020.csv"  "org_lead_2020.html"
 [7] "org_lead_2021.csv"  "org_lead_2021.html" "org_lead_2022.csv" 
[10] "org_lead_2022.html" "org_lead_2023.csv"  "org_lead_2023.html"The read_csv() function is vectorized — it can accept a list of file paths
It will automatically attempt to combine each file by row
Error: Files must have consistent column names:
* File 1 column 4 is: C2018_A.First or Second Major
* File 2 column 4 is: C2019_A.First or Second MajorWe get an error because the column names do not match
Instead of using the vectorized read_csv(), let’s solve this problem with map()
dfs <- map(.x = str_c("./13_data/", files),
           .f = read_csv)
dfs[[1]] |> select(1:3) # for display purposes# A tibble: 136 Ă— 3
   unitid `institution name`              year
    <dbl> <chr>                          <dbl>
 1 100690 Amridge University              2018
 2 102669 Alaska Pacific University       2018
 3 107141 John Brown University           2018
 4 110361 California Baptist University   2018
 5 112075 Concordia University-Irvine     2018
 6 119173 Mount Saint Mary's University   2018
 7 119605 National University             2018
 8 121150 Pepperdine University           2018
 9 121309 Point Loma Nazarene University  2018
10 121691 University of Redlands          2018
# ℹ 126 more rowsI can now iterate through each tibble in the dfs list to standardize the column names
First, I’ll make a function to help standardize names
dfs and assign the output to a new listorg_lead_confs <- list_rbind(clean_dfs)
org_lead_confs |>
  slice_sample(n = 10) |>
  select(1:3, `Grand total`)# A tibble: 10 Ă— 4
   unitid `institution name`                         year `Grand total`
    <dbl> <chr>                                     <dbl>         <dbl>
 1 178615 Truman State University                    2021             6
 2 149514 Trinity International University-Illinois  2021            10
 3 179159 Saint Louis University                     2018            14
 4 480569 Florida Institute of Technology-Online     2022            20
 5 136330 Palm Beach Atlantic University             2020            32
 6 489937 Carolina University                        2022             8
 7 228787 The University of Texas at Dallas          2023            34
 8 127918 Regis University                           2019            75
 9 152336 University of Saint Francis-Fort Wayne     2021            22
10 162928 Johns Hopkins University                   2021             0walk() functionalFunctions in R tend to return some sort of output
Certain function also have side effects — they do something else other than returning output, for example:
change variables in the environment
plot graphics
save data to delimited text files
The walk() functional is used when all we care about are the side effects of a function
DSC 210 Data Wrangling