Following this lecture, students should be able to:
map()
functional (and its atomic variants) worksmap()
functional (and its atomic variants) to accomplish iteration tasksmap()
and walk()
The {tidyverse}
metapackage includes a new-to-us packages called {purrr}
Mutating functions take a vector and return a vector of the same length
Summarizing functions take a vector as input and return a single value as output
We can use the across()
function (itself a functional, btw) to apply these functions to multiple columns “at once”
map()
functionalBig idea: the map()
function
map()
is the tidyverse’s Platonic, general-purpose functional. It takes a vector (atomic or list) and a function, and applies the function to each element of the vector, returning a list with same number of elements.
map()
in action!x <- 1:10
l <- map(.x = x, # the input vector to apply the function to
.f = \(x) if_else(x %% 2 == 0, "even", "odd")) # the function
length(l) == length(x) # same length as x
[1] TRUE
[[1]]
[1] "odd"
[[2]]
[1] "even"
[[3]]
[1] "odd"
map()
functional replaces for()
loops which iterate through the elements of a vector (or list) and populate some output listmap()
functional has a ...
argument that we can use to pass to the function defined by .f=
.x <- c("10-15-24", "10-16-24", "10-17-24", "10-18-24", "10-19-24")
map(.x = x,
.f = mdy,
tz = "US/Eastern") # tz= is an argument within mdy()
[[1]]
[1] "2024-10-15 EDT"
[[2]]
[1] "2024-10-16 EDT"
[[3]]
[1] "2024-10-17 EDT"
[[4]]
[1] "2024-10-18 EDT"
[[5]]
[1] "2024-10-19 EDT"
map()
within mutate()
map()
is effectively a mutating functionVectorized function
# A tibble: 234 Ă— 3
model displ year
<chr> <dbl> <fct>
1 a4 1.8 1999
2 a4 1.8 1999
3 a4 2 2008
4 a4 2 2008
5 a4 2.8 1999
6 a4 2.8 1999
7 a4 3.1 2008
8 a4 quattro 1.8 1999
9 a4 quattro 1.8 1999
10 a4 quattro 2 2008
# ℹ 224 more rows
map()
# A tibble: 234 Ă— 3
model displ year
<chr> <dbl> <list>
1 a4 1.8 <fct [1]>
2 a4 1.8 <fct [1]>
3 a4 2 <fct [1]>
4 a4 2 <fct [1]>
5 a4 2.8 <fct [1]>
6 a4 2.8 <fct [1]>
7 a4 3.1 <fct [1]>
8 a4 quattro 1.8 <fct [1]>
9 a4 quattro 1.8 <fct [1]>
10 a4 quattro 2 <fct [1]>
# ℹ 224 more rows
Big idea: list columns and hierarchy
When we introduced data frames and tibbles, we described them as lists of atomic vectors, each atomic vector being a column. These data structures also support list columns! This is our first taste of hierarchical datasets, a concept that we will get more practice with when learn to import JSON data using web APIs.
The map()
functional always returns a list
This behavior ensures that the output is always the same length as the input
sample_sizes <- c(2, 4, 6)
samples <- map(.x = sample_sizes,
.f = \(x) runif(x, min = 0, max = 1))
samples
[[1]]
[1] 0.931081766 0.002136413
[[2]]
[1] 0.9211738 0.5341741 0.3408999 0.1276429
[[3]]
[1] 0.69030450 0.82069657 0.78606399 0.69244715 0.03834361 0.85531494
purrr::list_c()
to concatenate these values into an atomic vector# A tibble: 234 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <fct> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
map()
variants including map_lgl()
, map_dbl()
, map_int()
, and map_chr()
output corresponding atomic vectors
There is no map_fct()
, but map_vec()
returns an atomic vector based on the most common type of the list elements
# A tibble: 234 Ă— 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <fct> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
The following exercises use the nycflights23::flights
dataset.
map_dbl()
to convert all values of distance
from miles to feet. There are 5,280 feet in one mile.map_chr()
to make all the letters in tailnum
lowercase using str_to_lower()
.map()
at all?Motivating question
In a language like R, where we get so much iteration “for free” (through vectorized functions), why would we need to use map()
and it’s variants at all?
{furrr}
; not discussed in this lecture)Note: this example is somewhat contrived
To illustrate the utility of the progress bar, I’ve concocted a scenario that you will likely never find useful in your data science careers (unless you become an introductory statistics professor). Time-consuming operations like this are common, but relate to tasks that we haven’t gotten to yet (e.g., scraping columns of URLs, querying APIs based on columns of search parameters), or won’t ever cover in this class (e.g., sensitivity analysis in probabilistic modeling, bootstrapping, etc.)
{tidyverse}
loaded in, then run it.tibble(size = 1:50000) |>
mutate(sample_mean = map_dbl(
.x = size,
.f = \(x) mean(rnorm(x, mean = 0, sd = 1)),
.progress = TRUE)) |>
ggplot(aes(x = size,
y = sample_mean)) +
geom_line(linewidth = 0.1) +
geom_hline(yintercept = 0,
color = "gray",
linetype = 2) +
labs(title = "The law of large numbers",
x = "Sample size",
y = "Sample mean")
Motivating task
I have six .csv files that report data on organizational leadership graduate degree conferrals where each file corresponds to a different academic year (2018 to 2023). I’d like to import them all, and then combine them into a single dataset.
[1] "org_lead_2018.csv" "org_lead_2018.html" "org_lead_2019.csv"
[4] "org_lead_2019.html" "org_lead_2020.csv" "org_lead_2020.html"
[7] "org_lead_2021.csv" "org_lead_2021.html" "org_lead_2022.csv"
[10] "org_lead_2022.html" "org_lead_2023.csv" "org_lead_2023.html"
The read_csv()
function is vectorized — it can accept a list of file paths
It will automatically attempt to combine each file by row
Error: Files must have consistent column names:
* File 1 column 4 is: C2018_A.First or Second Major
* File 2 column 4 is: C2019_A.First or Second Major
We get an error because the column names do not match
Instead of using the vectorized read_csv()
, let’s solve this problem with map()
dfs <- map(.x = str_c("./13_data/", files),
.f = read_csv)
dfs[[1]] |> select(1:3) # for display purposes
# A tibble: 136 Ă— 3
unitid `institution name` year
<dbl> <chr> <dbl>
1 100690 Amridge University 2018
2 102669 Alaska Pacific University 2018
3 107141 John Brown University 2018
4 110361 California Baptist University 2018
5 112075 Concordia University-Irvine 2018
6 119173 Mount Saint Mary's University 2018
7 119605 National University 2018
8 121150 Pepperdine University 2018
9 121309 Point Loma Nazarene University 2018
10 121691 University of Redlands 2018
# ℹ 126 more rows
I can now iterate through each tibble in the dfs
list to standardize the column names
First, I’ll make a function to help standardize names
dfs
and assign the output to a new listorg_lead_confs <- list_rbind(clean_dfs)
org_lead_confs |>
slice_sample(n = 10) |>
select(1:3, `Grand total`)
# A tibble: 10 Ă— 4
unitid `institution name` year `Grand total`
<dbl> <chr> <dbl> <dbl>
1 178615 Truman State University 2021 6
2 149514 Trinity International University-Illinois 2021 10
3 179159 Saint Louis University 2018 14
4 480569 Florida Institute of Technology-Online 2022 20
5 136330 Palm Beach Atlantic University 2020 32
6 489937 Carolina University 2022 8
7 228787 The University of Texas at Dallas 2023 34
8 127918 Regis University 2019 75
9 152336 University of Saint Francis-Fort Wayne 2021 22
10 162928 Johns Hopkins University 2021 0
walk()
functionalFunctions in R tend to return some sort of output
Certain function also have side effects — they do something else other than returning output, for example:
change variables in the environment
plot graphics
save data to delimited text files
The walk()
functional is used when all we care about are the side effects of a function
DSC 210 Data Wrangling