Strings and Spell Casting (regexps)

Sam Mason

Learning goals

Following this lecture, you should be able to:

Write simple regular expressions to match patterns in character vectors
Use regular expressions to transform datasets

Packages

library(tidyverse) # including the {stringr} package

Creating string literals

Big idea: Strings are created with " or '

In R, we create strings by enclosing characters in single (') or double (") quotes. As a useful convention, we default to double quotes.

single_quotes <- 'we already know how to make strings'
double_quotes <- "but it's nice to start with the basics!"

Notice that using double quotes allows us to include the single quote character in our string without issue!

Note: What is a literal?

In programming, a “literal” is a representation of a specific value type in the code itself. Literals are created using specific notation instead of more conventional function syntax. For example, we create a string literal by enclosing characters in quotes; we don’t use a quote() or string() function.

Literals are the things that we use to build vectors from scratch. We worked with several other examples of R literals during our Vectors, Lists, and Tibbles lecture. For example, you might remember that integer literals are notated using an L (e.g., 1L), and that Boolean literals are notated using the reserved names TRUE and FALSE.

Escaping

What happens if our string contains double quotes?
Just enclose the string in single quotes!

'I think it was Plato who once said, "String literals are neat!"'

How about when your string contains both single and double quotes?
In R, " and ' create string literals — this is their primary meaning
We can escape this primary meaning by using the back-slash (\)

"I think it was Ghandi who once said, \"Prof. Masons's lectures are the best!\""

The \ says, “ignore the primary meaning of " in favor of its literal meaning.”
To view the underlying representation of this string literal we can use str_view()

str_view("I think it was Ghandi who once said, \"Prof. Masons's lectures are the best!\"")

[1] │ I think it was Ghandi who once said, "Prof. Masons's lectures are the best!"

Wait, but if \ means “escape”, then how do we represent the literal back-slash character?
Well we need to escape the escape of course!

str_view("directory: C:\\Sam\\Documents\\top_secret\\cute_puppy_photos")

[1] │ directory: C:\Sam\Documents\top_secret\cute_puppy_photos

Other common string escapes

The primary meaning of n is, well, literally the character n
We can escape this primary meaning as \n to mean “new line”
The primary meaning of t is, again, literally the character t
We can escape this primary meaning as \t to mean “tab”

Tip: New lines in Windows

You may come across strings containing \r\n. This is the way that “new line” is encoded on Windows platforms.

In-class exercises

You can use the R editor below to complete the following exercises. The editor does not save your work. If you’d prefer, you can also complete exercises in Posit Cloud.

Exercises
Solutions

Create strings literals that represent the following values. Use str_view() to check your work.
1. He said "That's amazing!"
2. \a\b\c\d
3. \\\\\\ (lol)

# exercise 1
# part a
str_view("He said \"That's amazing!\"")

# part b
str_view("\\a\\b\\c\\d")

# part c
str_view("\\\\\\\\\\\\")

Intro to regular expressions

When we first learned to filter() we used equality (==) to subset rows based on the identity of values

mpg |> filter(manufacturer == "audi")
mgp |> filter(cyl == 1)

We next learned about logical operators, which allow us to subset based on the characteristics of values

mpg |> filter(hwy >= 25) # based on the magnitude of the number
mpg |> filter(model < "m") # based on location in alphabet (neat!)

We can easily imagine many additional characteristics of numbers by which to filter
- All numbers divisible by 3
- All prime numbers
- All odd numbers
- All negative numbers
It can, at first, be difficult to come up with a similar list for strings
Once we get started, however, the list of possible string characteristics will seem unending
- The length of the string
- The case of the characters in the string
- The presence of number characters in the string
- The presence of special characters in the string
- The presence of whitespace in the string
- The characters at the start of the string
- The characters at the end of the string
- The characters not in the string
- The number of repeated characters
- The order of the characters in the string
Regular expressions (regexs; regexps) give us the ability to describe these (often) complex string patterns in a concise way

Casting spells?

Big idea: Pattern matching

We learn how to write regular expressions so that we can find patterns among the characters of strings. Once we’ve described a pattern using a regular expression, we can do all sorts of useful things including:

Detect patterns in strings to filter() using str_detect()
Remove matched characters using str_remove()
Extract matched characters to mutate() using str_extract()

More on these functions at the end of the lecture.

At first (and often for much longer), regular expressions will seem incredibly unintuitive and somewhat magical (🧙‍♂️)
Here’s an example of a fairly commonly applied type of regular expression

\^(?=.*\[A-Za-z\])(?=.*\d)(?=.\*\[\@$!%*#?&])[A-Za-z\d@$!%\*#?&\]{8,}\$

Any guesses as to what character pattern this matches?
Used to check password requirements — matches to strings with:
- At least 8 characters
- At least one letter
- At least one numbers
- At least one special character

Basic pattern matching

Note: The fruit vector

We’ll be using the stringr::fruit character vector throughout the following slides. We’ve met this example data before, but here again is what it looks like:

str_glue("There are {length(fruit)} fruit names strings")

There are 80 fruit names strings

fruit[1:5]

[1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"

The simplest patterns to match to are just sequences of literal characters
For these basic patterns, regular expressions look just like strings (phew!)
For example, let’s find all strings from fruit that contain the letter "q"

# str_view() is again a helpful function for us here
str_view(fruit, pattern = "q")

[43] │ kum<q>uat
[46] │ lo<q>uat
[67] │ <q>uince

How about all the fruits that have "berry" in the name

str_view(fruit, pattern = "berry")

 [6] │ bil<berry>
 [7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>

The pattern matching pipeline

In the last slide, we wrote the following regular expression: "berry"
Let’s track "berry" on its journey from string literal in R code to the character pattern it matches to

	Stage	Representation
1	String literal in R code	`"berry"`
2	String value/regular expression	`berry`
3	Match	`berry`

We give R the literal string "berry"
This literal is represented as the string value berry
Which is interpreted as the regular expression berry
Which is then evaluated to match the character pattern berry

Literals and metacharacters

All of the characters in "berry" are literals — they match to exactly the characters that they are
Certain other characters, called metacharacters, have non-literal meanings
To illustrate, let’s find all fruit names starting with “a”

str_view(fruit, pattern = "^a")

[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado

^ is a metacharacter that anchors a pattern at the start of a string
We can find all names that end in "n" using a different anchor, $

str_view(fruit, pattern = "n$")

[13] │ canary melo<n>
[24] │ damso<n>
[27] │ duria<n>
[44] │ lemo<n>
[60] │ persimmo<n>
[66] │ purple mangostee<n>
[68] │ raisi<n>
[69] │ rambuta<n>
[72] │ rock melo<n>
[80] │ watermelo<n>

Tip: Remembering ^ and $

Here’s a useful mnemonic to help you remember that ^ (the exponentiation or “power” symbol) is the start anchor, and $ the end: if you start with power, you end with money. A bleak but accurate commentary on our world…

	Stage	Representation
1	String literal in R code	`"n$"`
2	String value/regular expression	`n$`
3	Match	`n` at the end

Escaping in regexp-land

We’ve seen that we can use \ to escape primary meanings in string-land
How do we escape in regexp-land, and why would we need to?

prices <- c("1,642", "$523", "$2,007")

I’d like to write a regular expression that matches to the $ character

str_view(prices, pattern = "$")

[1] │ 1,642<>
[2] │ $523<>
[3] │ $2,007<>

This attempt does not work of course because $ is a metacharacter — it’ll match to end of all strings

	Stage	Representation
1	String literal in R code	`"$"`
2	String value\regular expression	`$`
3	Match	the end of the string

To escape $’s primary meaning, let’s give \ a try!

str_view(prices, pattern = "\$")

Error: '\$' is an unrecognized escape in character string (<text>:1:30)

Error?! Remember, we go from R-land (string literal) to string-land
In string-land, $ has no secondary meaning to escape into, so we get an error
We need to escape the escape!

str_view(prices, pattern = "\\$")

[2] │ <$>523
[3] │ <$>2,007

	Stage	Representation
1	String literal in R code	`"\\$"`
2	String value/regular expression	`\$`
3	Match	`$`

One character to match them all

The . metacharacter matches to all characters except the newline character \n
Find all fruit names that end with a any character e

str_view(fruit, pattern = "a.e$")

[25] │ d<ate>
[34] │ gr<ape>
[64] │ pomegran<ate>

If we wanted to match to the literal . character, we’d need to double escape

In-class exercises

You can use the R editor below to complete the following exercises. The editor does not save your work. If you’d prefer, you can also complete exercises in Posit Cloud.

Exercises
Solutions

Find all strings in fruit that start with bl
Using the three metacharacters we’ve talked about so far (^, $, and .), find all the fruits that are exactly seven characters long
In the numbers vector created above, find all decimal place characters
What character pattern does the string literal "\\\\" match to?

# exercise 2
str_view(fruit, pattern = "^bl")

# exercise 3
str_view(fruit, pattern = "^.......$")

# exercise 4
str_view(numbers, pattern = "\\.")

# exercise 5
# it matches to the literal \ character
# "\\\\" becomes \\ (two \ get escaped in string-land)
# and \\ (an escaped escape in regexp-land) matches to \
str_view(string = "\\", pattern = "\\\\")

Repeating patterns

We can use a more concise regexp to solve exercise 3 on the previous slide

We might also find it useful to find all names with at least 7 characters

str_view(fruit, pattern = "^.{7,}$")

 [2] │ <apricot>
 [3] │ <avocado>
 [5] │ <bell pepper>
 [6] │ <bilberry>
 [7] │ <blackberry>
 [8] │ <blackcurrant>
 [9] │ <blood orange>
[10] │ <blueberry>
[11] │ <boysenberry>
[12] │ <breadfruit>
[13] │ <canary melon>
[14] │ <cantaloupe>
[15] │ <cherimoya>
[17] │ <chili pepper>
[18] │ <clementine>
[19] │ <cloudberry>
[20] │ <coconut>
[21] │ <cranberry>
[22] │ <cucumber>
[23] │ <currant>
... and 32 more

Or perhaps between 3 and 5 characters

str_view(fruit, pattern = "^.{3,5}$")

 [1] │ <apple>
[25] │ <date>
[31] │ <fig>
[34] │ <grape>
[36] │ <guava>
[44] │ <lemon>
[45] │ <lime>
[49] │ <mango>
[52] │ <nut>
[53] │ <olive>
[58] │ <peach>
[59] │ <pear>
[63] │ <plum>

Quantifiers with options

Note: The words and sentences vectors

The {stringr} package includes two other example data objects called words and sentences.

str_glue("There are {length(words)} word strings")

There are 980 word strings

words[1:5]

[1] "a"        "able"     "about"    "absolute" "accept"

str_glue("There are {length(sentences)} sentence strings")

There are 720 sentence strings

sentences[1:5]

[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."

Make a pattern optional (zero or one) with ?
Allow a pattern to repeat one or more times with +
Make a pattern optional or allow it to repeat one or more times with *

# Match to all sentences containing "boy" or "boys"
str_view(sentences, pattern = "boys?")

 [11] │ The <boy> was there when the sun rose.
 [25] │ The beauty of the view stunned the young <boy>.
[423] │ The <boy> owed his pal thirty cents.
[591] │ Stop whistling and watch the <boys> march.
[634] │ It was done before the <boy> could see it.
[663] │ The poor <boy> missed the boat again.
[708] │ He sent the <boy> on a short errand.

# Match to words starting with a, containing one or more of any
# character, and ending with a
str_view(words, pattern = "^a.+a$")

[36] │ <america>
[49] │ <area>

# Match to all words starting with "p", followed by either nothing
# or any number of characters, and ending with "er"
str_subset(words, pattern = "^p.*er$")

[1] "paper"  "per"    "power"  "proper"

Note: Parentheses in regular expressions

Quantifiers only modify the character or metacharacter immediately preceding them. So in the regular expression boys?, the ? quantifier only affects s. If we wanted this quantifier to modify y and s together, we’d need to group them with parentheses like bo(ys)?.

Alternates

I’d like to find all fruits that start with a vowel
Just like in R, | means “or” in regexp-land

We can accomplish this same pattern match a bit more succinctly using a character class

str_view(fruit, pattern = "^[aeiou]")

 [1] │ <a>pple
 [2] │ <a>pricot
 [3] │ <a>vocado
[28] │ <e>ggplant
[29] │ <e>lderberry
[53] │ <o>live
[54] │ <o>range
[79] │ <u>gli fruit

Character classes become even more useful when we have a set of character that we don’t want to match to
Within [], ^ means “anything but”

# Match to any fruit name starting with anything but a, e, i, o, or u
str_view(fruit, pattern = "^[^aeiou]")

 [4] │ <b>anana
 [5] │ <b>ell pepper
 [6] │ <b>ilberry
 [7] │ <b>lackberry
 [8] │ <b>lackcurrant
 [9] │ <b>lood orange
[10] │ <b>lueberry
[11] │ <b>oysenberry
[12] │ <b>readfruit
[13] │ <c>anary melon
[14] │ <c>antaloupe
[15] │ <c>herimoya
[16] │ <c>herry
[17] │ <c>hili pepper
[18] │ <c>lementine
[19] │ <c>loudberry
[20] │ <c>oconut
[21] │ <c>ranberry
[22] │ <c>ucumber
[23] │ <c>urrant
... and 52 more

Character classes also support sequence (range) syntax
If we wanted to match to any lowercase letter we could define the class [a-z], for example

passwords <- c("password123", "QWERTY321", "p@$$word!!")

# any letter (upper or lower) followed by any number
str_view(passwords, pattern = "[a-zA-Z][0-9]")

[1] │ passwor<d1>23
[2] │ QWERT<Y3>21

In-class exercises

You can use the R editor below to complete the following exercises. The editor does not save your work. If you’d prefer, you can also complete exercises in Posit Cloud.

Exercises
Solutions

Find all strings in words that contain any two vowels next to each other
In the numbers vector defined above, elements 1, 3, and 5 are valid 9-digit phone numbers. Write a regular expression that will match to only these strings.
Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string literal that defines a regular expression.)
1. ^.*$
2. "\\{.+\\}"
3. "\\\\{4}"

# exercise 6
str_view(words, pattern = "[aeiou]{2}")

# exercise 7
str_view(numbers, pattern = "^[0-9]{3}.[0-9]{3}.[0-9]{4}$")

# exercise 8
# part a
# start of string, zero or more characters, end of string

# part b
# the literal { character followed by any character one or more times followed by the literal } character

# part c
# the literal \ character four times in a row

The `{stringr}` cheatsheet

If you’re anything like me, you’re going to have a hard time remembering all of this regexp syntax
Luckily, there is an incredibly handy cheatsheet that you can reference. The second page contains all the useful regexp info.
I’ve been casting spells (so to speak) for awhile now and still need to consult the mystic tome (I’m enjoying this metaphor) quite frequently.

In-class exercises

You can use the R editor below to complete the following exercises. The editor does not save your work. If you’d prefer, you can also complete exercises in Posit Cloud.

Exercises
Solutions

The following exercises relate to a new regular expression concept called “look-arounds”. Use the {stringr} cheatsheet (the material related to look-arounds is near the bottom of the second page) to complete these exercises. These exercises use the words vector.

Match to any character preceded by a vowel using a look-around. str_view_all() will show multiple matches within each string.
Match to all c, s, t and w characters not followed by an h using a look-around. str_view_all() will show multiple matches within each string.

# exercise 9
str_view_all(words, "(?<=[aeiou]).")

 [1] │ a
 [2] │ a<b>le
 [3] │ a<b>o<u><t>
 [4] │ a<b>so<l>u<t>e
 [5] │ a<c>ce<p>t
 [6] │ a<c>co<u><n>t
 [7] │ a<c>hi<e><v>e
 [8] │ a<c>ro<s>s
 [9] │ a<c>t
[10] │ a<c>ti<v>e
[11] │ a<c>tu<a><l>
[12] │ a<d>d
[13] │ a<d>dre<s>s
[14] │ a<d>mi<t>
[15] │ a<d>ve<r>ti<s>e
[16] │ a<f>fe<c>t
[17] │ a<f>fo<r>d
[18] │ a<f>te<r>
[19] │ a<f>te<r>no<o><n>
[20] │ a<g>a<i><n>
... and 960 more

# exercise 10
str_view_all(words, "[cstw](?!h)")

 [1] │ a
 [2] │ able
 [3] │ abou<t>
 [4] │ ab<s>olu<t>e
 [5] │ a<c><c>ep<t>
 [6] │ a<c><c>oun<t>
 [7] │ achieve
 [8] │ a<c>ro<s><s>
 [9] │ a<c><t>
[10] │ a<c><t>ive
[11] │ a<c><t>ual
[12] │ add
[13] │ addre<s><s>
[14] │ admi<t>
[15] │ adver<t>i<s>e
[16] │ affe<c><t>
[17] │ afford
[18] │ af<t>er
[19] │ af<t>ernoon
[20] │ again
... and 960 more

Data transformation with regexps

Regular expressions can help us transform datasets when combined with key {stringr} function.
- Filtering with str_detec()
- Cleaning character columns with str_replace() and str_remove()
- Creating new column with str_extract()

starwars

# A tibble: 87 × 14
   name       height  mass hair_color skin_color eye_color birth_year sex  
   <chr>       <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke Skyw…    172    77 blond      fair       blue            19   male 
 2 C-3PO         167    75 <NA>       gold       yellow         112   none 
 3 R2-D2          96    32 <NA>       white, bl… red             33   none 
 4 Darth Vad…    202   136 none       white      yellow          41.9 male 
 5 Leia Orga…    150    49 brown      light      brown           19   fema…
 6 Owen Lars     178   120 brown, gr… light      blue            52   male 
 7 Beru Whit…    165    75 brown      light      blue            47   fema…
 8 R5-D4          97    32 <NA>       white, red red             NA   none 
 9 Biggs Dar…    183    84 black      light      brown           24   male 
10 Obi-Wan K…    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Filtering

# filtering for all characters with droid names
starwars |>
  filter(str_detect(name, pattern = "^[A-Z0-9]+-[A-Z0-9]+$"))

# A tibble: 5 × 14
  name   height  mass hair_color skin_color  eye_color birth_year sex  
  <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr>
1 C-3PO     167    75 <NA>       gold        yellow           112 none 
2 R2-D2      96    32 <NA>       white, blue red               33 none 
3 R5-D4      97    32 <NA>       white, red  red               NA none 
4 IG-88     200   140 none       metal       red               15 none 
5 R4-P17     96    NA none       silver, red red, blue         NA none 
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

Mutating

# creating a new column that holds each films acronym
starwars |>
  unnest_longer(col = films) |>
  # str_extract_all() returns a vectors of extracted characters
  mutate(film_acronym = str_extract_all(films,
                                        pattern = "\\b[a-zA-z]{1}"),
         # str_flatten() combines vector into single string
         film_acronym = map_chr(.x = film_acronym,
                                .f = \(x) str_flatten(x))) |>
  select(name, species, films, film_acronym)

# A tibble: 173 × 4
   name           species films                   film_acronym
   <chr>          <chr>   <chr>                   <chr>       
 1 Luke Skywalker Human   A New Hope              ANH         
 2 Luke Skywalker Human   The Empire Strikes Back TESB        
 3 Luke Skywalker Human   Return of the Jedi      RotJ        
 4 Luke Skywalker Human   Revenge of the Sith     RotS        
 5 Luke Skywalker Human   The Force Awakens       TFA         
 6 C-3PO          Droid   A New Hope              ANH         
 7 C-3PO          Droid   The Empire Strikes Back TESB        
 8 C-3PO          Droid   Return of the Jedi      RotJ        
 9 C-3PO          Droid   The Phantom Menace      TPM         
10 C-3PO          Droid   Attack of the Clones    AotC        
# ℹ 163 more rows

Strings and Spell Casting (regexps)

Learning goals

Packages

Creating string literals

Escaping

Other common string escapes

In-class exercises

Intro to regular expressions

Casting spells?

Basic pattern matching

The pattern matching pipeline

Literals and metacharacters

Escaping in regexp-land

One character to match them all

In-class exercises

Repeating patterns

Quantifiers with options

Alternates

In-class exercises

The {stringr} cheatsheet

In-class exercises

Data transformation with regexps

The `{stringr}` cheatsheet