At the end of this lecture, students should be able to:
mutate()
and summarize()
All functions are composed of three fundamental pieces
The name (raise
)
The argument(s) (objects base
and exp
)
The body (base^exp
)
Important: Sticking to the plan
It is very hard to troubleshoot when it lives inside of a function(){}
constructor. Always write working code first, then try to put it in a function constructor. Don’t forget to test!
Motivating utility
I’d like to create a function that calculates the proportion of missing values in a vector.
Step 1: Write some code that works
Step 2: Identify inputs and outputs
Input: the vector
Output: the proportion of missing values (a single value)
Step 4: Generalize the code in the body
Step 5: Test your function thoroughly
When building functions, we often spend a lot of time creating “example” data
R contains a robust set of vector generation tools that we can take advantage of
Creating integer sequences
Creating double sequences
Creating repeating vector
Tip: Manipulating vector order
You can use the rev()
function to reverse the order of the elements of a vector, and the sort()
function to order the elements from smallest to largest (by default decreasing = FALSE
).
Random sampling from a vector
circumference()
that takes a radius (r) value and return the corresponding circumference (C) according to the equation \(C = 2*\pi*r\). R has a built-in object called pi
that contains \(\pi\) out to the first 6 decimal places.se()
that calculates the standard error (SE) of a numeric vector according to the equation \(SE(x) = \frac{SD(x)}{\sqrt{n}}\)both_na()
that takes two vectors of the same length and returns the number of positions that have an NA
in both vectors.# exercise 1
## defining the function
circumference <- function(r) {
2 * pi * r
}
## testing the function
circumference(r = 0) # should be 0
circumference(r = NA) # should be NA
circumference(r = 1) # should be 2pi
# exercise 2
## defining the function
se <- function(x) {
sd(x) / sqrt(length(x))
}
## testing the function
se(x = rep(1, 10)) # should be 0
se(x = c(1, 2, 3, 4, NA)) # should be NA
se(x = rnorm(1000, 0, 10)) # should give some value close to 1
# exercise 3
## defining the function
both_na <- function(x, y) {
sum(is.na(x) & is.na(y))
}
## testing the function
both_na(x = c(1, 2, 3), y = c(1, 2, 3)) # should be 0
both_na(x = c(NA, 2, NA), y = c(NA, 2, 3)) # should be 1
both_na(x = c(NA, NA, NA), y = c(NA, NA, NA)) # should be 3
By default, functions like mean()
don’t remove NA
values from the input vector prior to computation
Let’s “wrap” the mean()
function in our own function meanNA()
Tip: dot-dot-dot
The ...
argument in the user-defined wrapper accepts any number of arguments, and then sends those arguments to another function within the body of the wrapper.
We use the ...
argument so that meanNA()
can accept any of the arguments that you’d usually pass to mean()
without having to name those argument explicitly when creating the wrapper
This is fun! Let’s make another wrapper function!
The stringr::str_c()
is used to combine the elements of character vectors
chr_a <- c("Hey, ", "Hi, ", "Hello, ")
chr_b <- c("Jaewan", "Liana", "Rebecca")
chr_c <- c("!", "!", "?")
str_c(chr_a, chr_b, chr_c)
[1] "Hey, Jaewan!" "Hi, Liana!" "Hello, Rebecca?"
collapse=
that’ll just take all the strings and smoosh ’em together into a single stringNote: str_flatten()
The functionality of str_smoosh()
already exists in the {stringr}
package as str_flatten()
. I think str_smoosh()
is a better name though.
log()
takes a numeric vector and returns the natural logarithmic transformation of that vector (i.e., calculating the natural log for each element of the vector). Create a wrapper function called log3
that returns the log base 3 transformation of a given numeric vector. The log()
function that you’re wrapping only has two arguments anyway, so you don’t need to use a ...
argument.summarize()
removes that last grouping level. Create a wrapper function around summarize()
that does not remove any grouping levels. Take a peek at the .group=
argument.# exercise 4
## defining the function
log3 <- function(x) {
log(x, base = 3)
}
## testing the funciton out
log3(c(1, 3, 9, 27, 81)) # should give 0, 1, 2, 3, 4
# exercise 5
## defining the function
summarize_keep <- function(...) {
summarize(..., .groups = "keep")
}
## testing the function out
### should yield a tibble with 6 groups (and does!)
penguins |>
drop_na(sex) |>
group_by(species, sex) |>
summarize_keep(avg_body_mass = meanNA(body_mass_g))
Let’s add some functionality to R’s built-in length()
function
I want to have the option to remove NA
values before calculating the length of a vector
# defining the function
lengthNA <- function(x, na.rm = FALSE) { # FALSE is the default
if (na.rm == FALSE) { # if the na.rm object is FALSE
length(x) # then simply calculate the length
} else { # otherwise
length(x[!is.na(x)]) # remove NAs then calculate length
}
}
# testing the funciton out
lengthNA(1:5) # should return 5
[1] 5
[1] 4
Note: logical indexing
In the user-defined function above, is.na()
returns a logical vector that is true when x
is NA
and FALSE
otherwise. I invert this logical vector by calling !
on it (TRUE
becomes FALSE
and vice versa). I then use the resulting logical vector to index x
. Elements in x
that are NA
will be removed.
{dplyr}
) provides us with the if_else()
function that can make our code a bit cleanerImportant: if_else()
requirements
The if_else()
function can only be used when the outputs of true=
and false=
are the same length as the logical vector passed to condition=
.
rnorm
and gives the option (using a round=
argument) to round the sampled values to two decimal places. The default value of the round=
argument is FALSE
.# using base R conditional statements
rnorm_round <- function(..., round = FALSE) {
if (round == FALSE) {
rnorm(...)
} else {
round(rnorm(...), digits = 2)
}
}
# testing out the function
rnorm_round(n = 10, mean = 5, sd = 1) # a bunch of long doubles
rnorm_round(n = 10, mean = 5, sd = 1, round = TRUE) # two decimals
If a function takes a vector and returns a vector of the same length (e.g., the log3()
function you created) it is “mutating”
If a function takes a vector and returns a single value (e.g., the prop_miss()
function we made together) it is “summarizing”
Unsurprisingly, mutating functions work well within mutate()
, and summarizing functions work well in summarize()
For each exercise, review the code and (1) describe the utility of the function, (2) classify the function as mutating, summarizing, or something else, and (3) if mutating/summarizing, apply the function to a variable or variables in palmerpenguins::penguins
. Assume that arguments x
and y
are vectors.
y
to x
relative to y
. This is a mutating function.x
. This is a summarizing function.x
containing the ascending and then descending values of x
. This function is neither mutating nor summarizing.DSC 210 Data Wrangling