User-defined Functions

Sam Mason

Learning goals

At the end of this lecture, students should be able to:

Write simple functions in R
Wrap existing functions to be more useful
Apply conditional execution within functions
Apply user defined function within mutate() and summarize()

Packages

library(tidyverse)
library(palmerpenguins)

A simple function

Let’s write a simple function to get our bearings

raise <- function(base, exp) {
  base^exp
}

All functions are composed of three fundamental pieces
- The name (raise)
- The argument(s) (objects base and exp)
- The body (base^exp)

Steps to writing a function

Write some code that works
Identify inputs and outputs
Set inputs to arguments
Generalize your code in the function body
Test your function thoroughly

Important: Sticking to the plan

It is very hard to troubleshoot when it lives inside of a function(){} constructor. Always write working code first, then try to put it in a function constructor. Don’t forget to test!

A more complex function

Motivating utility

I’d like to create a function that calculates the proportion of missing values in a vector.

Step 1: Write some code that works

# creating an example vector to work with
vec <- c(1, 2, NA, 4, NA, 6, 7, 8, NA, 10)

# calculating the number of NA values
num_na <- sum(is.na(vec))

# calculating the proportion of NA values
num_na / length(vec)

[1] 0.3

Step 2: Identify inputs and outputs

Input: the vector
Output: the proportion of missing values (a single value)

Step 3: Set inputs to arguments

prop_miss <- function(x) {

}

Step 4: Generalize the code in the body

prop_miss <- function(x) {
  num_na <- sum(is.na(x))
  num_na / length(x)
}

Step 5: Test your function thoroughly

Generating vectors

When building functions, we often spend a lot of time creating “example” data
R contains a robust set of vector generation tools that we can take advantage of

Creating integer sequences

# create an integer vector with elements 1 through 10
1:10

 [1]  1  2  3  4  5  6  7  8  9 10

# create an integer vector with elements -2 through -7
-2:-7

[1] -2 -3 -4 -5 -6 -7

Creating double sequences

# create a double vector from 2 to 20 incremented by 2
seq(from = 2, to = 20, by = 2)

 [1]  2  4  6  8 10 12 14 16 18 20

# create a double vector from -3.5 to 1.5 by 0.5
seq(from = -3.5, to = 1.5, by = 0.5)

 [1] -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5

Creating repeating vector

# create a vector containing five zeros
rep(x = 0, times = 5)

[1] 0 0 0 0 0

# create a vector containing "Pop!" twice
rep(x = "Pop!", times = 2)

[1] "Pop!" "Pop!"

Tip: Manipulating vector order

You can use the rev() function to reverse the order of the elements of a vector, and the sort() function to order the elements from smallest to largest (by default decreasing = FALSE).

Random sampling from a vector

sample(x = state.name, # from a vector of state names (built-in)
       size = 5, # randomly grab 5 strings
       replace = FALSE) # and don't "put them back" once sampled

[1] "Vermont"       "Indiana"       "Massachusetts" "Delaware"     
[5] "Washington"

Random sampling from a normal distribution

x <- rnorm(n = 10,
           mean = 2,
           sd = 1)
round(x, digits = 2)

1: Sample 10 values from a normal distribution with…
2: …a mean of 2 and…
3: …a standard deviation of 1

 [1] 0.96 1.13 0.84 2.03 1.77 1.57 2.15 1.83 2.24 0.57

Create a function called circumference() that takes a radius (r) value and return the corresponding circumference (C) according to the equation \(C = 2*\pi*r\). R has a built-in object called pi that contains \(\pi\) out to the first 6 decimal places.
Create a function called se() that calculates the standard error (SE) of a numeric vector according to the equation \(SE(x) = \frac{SD(x)}{\sqrt{n}}\)
Challenge: Create a function called both_na() that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

# exercise 1
## defining the function
circumference <- function(r) {
  2 * pi * r
}

## testing the function
circumference(r = 0) # should be 0
circumference(r = NA) # should be NA
circumference(r = 1) # should be 2pi

# exercise 2
## defining the function
se <- function(x) {
  sd(x) / sqrt(length(x))
}

## testing the function
se(x = rep(1, 10)) # should be 0
se(x = c(1, 2, 3, 4, NA)) # should be NA
se(x = rnorm(1000, 0, 10)) # should give some value close to 1

# exercise 3
## defining the function
both_na <- function(x, y) {
  sum(is.na(x) & is.na(y))
}

## testing the function
both_na(x = c(1, 2, 3), y = c(1, 2, 3)) # should be 0
both_na(x = c(NA, 2, NA), y = c(NA, 2, 3)) # should be 1
both_na(x = c(NA, NA, NA), y = c(NA, NA, NA)) # should be 3

The execution environment

When we create a new object in R, it lives in what’s called the “global environment”

Each time a function runs it creates an “execution environment” and fills it with the objects created by argument assignments

# creating a function that prints out the objs. in the exec. env.
exec_env_ls <- function(x, y, z) {
  ls()
}

# running the function with arguments
exec_env_ls(x = 1, y = 2, z = 3)

[1] "x" "y" "z"

Wrapper functions

By default, functions like mean() don’t remove NA values from the input vector prior to computation
Let’s “wrap” the mean() function in our own function meanNA()

# defining the function
meanNA <- function(...) {
  mean(..., na.rm = TRUE)
}

# testing it out
meanNA(x = c(1, 2, NA, 0, 5)) # should be 2

[1] 2

meanNA(x = rep(0, 5)) # should be 0

[1] 0

meanNA(x = rep(1, 5)) # should be 1

[1] 1

meanNA(x = c(1, 2, 2, 0, 5)) # should be 2

[1] 2

Tip: dot-dot-dot

The ... argument in the user-defined wrapper accepts any number of arguments, and then sends those arguments to another function within the body of the wrapper.

We use the ... argument so that meanNA() can accept any of the arguments that you’d usually pass to mean() without having to name those argument explicitly when creating the wrapper
This is fun! Let’s make another wrapper function!
The stringr::str_c() is used to combine the elements of character vectors

chr_a <- c("Hey, ", "Hi, ", "Hello, ")
chr_b <- c("Jaewan", "Liana", "Rebecca")
chr_c <- c("!", "!", "?")
str_c(chr_a, chr_b, chr_c)

[1] "Hey, Jaewan!"    "Hi, Liana!"      "Hello, Rebecca?"

It also has this handy argument collapse= that’ll just take all the strings and smoosh ’em together into a single string

str_c(chr_a, chr_b, chr_c, collapse = " ")

[1] "Hey, Jaewan! Hi, Liana! Hello, Rebecca?"

Let’s wrap this functionality up in a new function

# defining the function
str_smoosh <- function(...){
  str_c(..., collapse = "")
}

# testing it out
str_smoosh(c("Super", "cali", "fragilistic", "expi", "ali", "docious"))

[1] "Supercalifragilisticexpialidocious"

Note: str_flatten()

The functionality of str_smoosh() already exists in the {stringr} package as str_flatten(). I think str_smoosh() is a better name though.

In-class exercises

Exercises
Solutions

The function log() takes a numeric vector and returns the natural logarithmic transformation of that vector (i.e., calculating the natural log for each element of the vector). Create a wrapper function called log3 that returns the log base 3 transformation of a given numeric vector. The log() function that you’re wrapping only has two arguments anyway, so you don’t need to use a ... argument.
By default, summarize() removes that last grouping level. Create a wrapper function around summarize() that does not remove any grouping levels. Take a peek at the .group= argument.

# exercise 4
## defining the function
log3 <- function(x) {
  log(x, base = 3)
}

## testing the funciton out
log3(c(1, 3, 9, 27, 81)) # should give 0, 1, 2, 3, 4

# exercise 5
## defining the function
summarize_keep <- function(...) {
  summarize(..., .groups = "keep")
}

## testing the function out
### should yield a tibble with 6 groups (and does!)
penguins |>
  drop_na(sex) |>
  group_by(species, sex) |>
  summarize_keep(avg_body_mass = meanNA(body_mass_g))

Conditional execution

Let’s add some functionality to R’s built-in length() function
I want to have the option to remove NA values before calculating the length of a vector

# defining the function
lengthNA <- function(x, na.rm = FALSE) { # FALSE is the default
  if (na.rm == FALSE) { # if the na.rm object is FALSE
    length(x) # then simply calculate the length
  } else { # otherwise
    length(x[!is.na(x)]) # remove NAs then calculate length
  }
}

# testing the funciton out
lengthNA(1:5) # should return 5

[1] 5

lengthNA(c(1, 2, 3, NA, 5), na.rm = TRUE) # should return 4

[1] 4

Note: logical indexing

In the user-defined function above, is.na() returns a logical vector that is true when x is NA and FALSE otherwise. I invert this logical vector by calling ! on it (TRUE becomes FALSE and vice versa). I then use the resulting logical vector to index x. Elements in x that are NA will be removed.

# creating an example vector with NA
x <- c(1, 2, 3, NA, 5)

# the result of is.na(x)
is.na(x)

[1] FALSE FALSE FALSE  TRUE FALSE

# the result of !is.na(x)
!is.na(x)

[1]  TRUE  TRUE  TRUE FALSE  TRUE

# the result of logical indexing
x[!is.na(x)]

[1] 1 2 3 5

The tidyverse (specifically {dplyr}) provides us with the if_else() function that can make our code a bit cleaner

lengthNA <- function(x, na.rm = FALSE) {
  if_else(condition = na.rm == FALSE,
          true = length(x),
          false = length(x[!is.na(x)]))
}

Important: if_else() requirements

The if_else() function can only be used when the outputs of true= and false= are the same length as the logical vector passed to condition=.

In-class exercise

Exercise
Solution

Create a function that wraps rnorm and gives the option (using a round= argument) to round the sampled values to two decimal places. The default value of the round= argument is FALSE.

# using base R conditional statements
rnorm_round <- function(..., round = FALSE) {
  if (round == FALSE) {
    rnorm(...)
  } else {
    round(rnorm(...), digits = 2)
  }
}

# testing out the function
rnorm_round(n = 10, mean = 5, sd = 1) # a bunch of long doubles
rnorm_round(n = 10, mean = 5, sd = 1, round = TRUE) # two decimals

Mutating and summarizing functions

If a function takes a vector and returns a vector of the same length (e.g., the log3() function you created) it is “mutating”
If a function takes a vector and returns a single value (e.g., the prop_miss() function we made together) it is “summarizing”
Unsurprisingly, mutating functions work well within mutate(), and summarizing functions work well in summarize()

In-class exercises

For each exercise, review the code and (1) describe the utility of the function, (2) classify the function as mutating, summarizing, or something else, and (3) if mutating/summarizing, apply the function to a variable or variables in palmerpenguins::penguins. Assume that arguments x and y are vectors.

Exercises
Solutions

# exercise 7
function_1 <- function(x) {
  x / sum(x, na.rm = TRUE)
}

# exercise 8
function_2 <- function(x, y) {
  ((x - y) / y) * 100
}

# exercise 9
function_3 <- function(x) {
  length(unique(x))
}

# exercise 10
function_4 <- function(x) {
  c(sort(x), rev(sort(x)))
}

Calculates the proportion of each element relative to the sum of all elements. This is a mutating function.

# calculating the proportional contribution that each
# penguin makes to the overall mass of all penguins!
penguins |>
  mutate(prop_body_mass = function_1(body_mass_g))

Calculates the percent change from y to x relative to y. This is a mutating function.

# this is kind of a silly application, but it works
# calculating the percent change in bill length relative
# to bill depth
penguins |>
  mutate(bill_perc_change = function_2(bill_length_mm,
                                       bill_depth_mm))

Calculates the number of unique values in x. This is a summarizing function.

# finding the number of different species on each island
penguins |>
  group_by(island) |>
  summarize(n_species = function_3(species))

Creates a vector twice the length of x containing the ascending and then descending values of x. This function is neither mutating nor summarizing.