Following this lecture, you should be able to:
Write simple regular expressions to match patterns in character vectors
Use regular expressions to transform datasets
Big idea: Strings are created with "
or '
In R, we create strings by enclosing characters in single ('
) or double ("
) quotes. As a useful convention, we default to double quotes.
Note: What is a literal?
In programming, a βliteralβ is a representation of a specific value type in the code itself. Literals are created using specific notation instead of more conventional function syntax. For example, we create a string literal by enclosing characters in quotes; we donβt use a quote()
or string()
function.
Literals are the things that we use to build vectors from scratch. We worked with several other examples of R literals during our Vectors, Lists, and Tibbles lecture. For example, you might remember that integer literals are notated using an L
(e.g., 1L
), and that Boolean literals are notated using the reserved names TRUE
and FALSE
.
What happens if our string contains double quotes?
Just enclose the string in single quotes!
How about when your string contains both single and double quotes?
In R, "
and '
create string literals β this is their primary meaning
We can escape this primary meaning by using the back-slash (\
)
The \
says, βignore the primary meaning of "
in favor of its literal meaning.β
To view the underlying representation of this string literal we can use str_view()
[1] β I think it was Ghandi who once said, "Prof. Masons's lectures are the best!"
Wait, but if \
means βescapeβ, then how do we represent the literal back-slash character?
Well we need to escape the escape of course!
The primary meaning of n
is, well, literally the character n
We can escape this primary meaning as \n
to mean βnew lineβ
The primary meaning of t
is, again, literally the character t
We can escape this primary meaning as \t
to mean βtabβ
Tip: New lines in Windows
You may come across strings containing \r\n
. This is the way that βnew lineβ is encoded on Windows platforms.
You can use the R editor below to complete the following exercises. The editor does not save your work. If youβd prefer, you can also complete exercises in Posit Cloud.
filter()
we used equality (==
) to subset rows based on the identity of valuesWe can easily imagine many additional characteristics of numbers by which to filter
All numbers divisible by 3
All prime numbers
All odd numbers
All negative numbers
It can, at first, be difficult to come up with a similar list for strings
Once we get started, however, the list of possible string characteristics will seem unending
The length of the string
The case of the characters in the string
The presence of number characters in the string
The presence of special characters in the string
The presence of whitespace in the string
The characters at the start of the string
The characters at the end of the string
The characters not in the string
The number of repeated characters
The order of the characters in the string
Regular expressions (regexs; regexps) give us the ability to describe these (often) complex string patterns in a concise way
Big idea: Pattern matching
We learn how to write regular expressions so that we can find patterns among the characters of strings. Once weβve described a pattern using a regular expression, we can do all sorts of useful things including:
Detect patterns in strings to filter()
using str_detect()
Remove matched characters using str_remove()
Extract matched characters to mutate()
using str_extract()
More on these functions at the end of the lecture.
At first (and often for much longer), regular expressions will seem incredibly unintuitive and somewhat magical (π§ββοΈ)
Hereβs an example of a fairly commonly applied type of regular expression
\^(?=.*\[A-Za-z\])(?=.*\d)(?=.\*\[\@$!%*#?&])[A-Za-z\d@$!%\*#?&\]{8,}\$
Any guesses as to what character pattern this matches?
Used to check password requirements β matches to strings with:
At least 8 characters
At least one letter
At least one numbers
At least one special character
Note: The fruit
vector
Weβll be using the stringr::fruit
character vector throughout the following slides. Weβve met this example data before, but here again is what it looks like:
The simplest patterns to match to are just sequences of literal characters
For these basic patterns, regular expressions look just like strings (phew!)
For example, letβs find all strings from fruit
that contain the letter "q"
[43] β kum<q>uat
[46] β lo<q>uat
[67] β <q>uince
"berry"
in the name [6] β bil<berry>
[7] β black<berry>
[10] β blue<berry>
[11] β boysen<berry>
[19] β cloud<berry>
[21] β cran<berry>
[29] β elder<berry>
[32] β goji <berry>
[33] β goose<berry>
[38] β huckle<berry>
[50] β mul<berry>
[70] β rasp<berry>
[73] β salal <berry>
[76] β straw<berry>
In the last slide, we wrote the following regular expression: "berry"
Letβs track "berry"
on its journey from string literal in R code to the character pattern it matches to
Stage | Representation | |
---|---|---|
1 | String literal in R code | "berry" |
2 | String value/regular expression | berry |
3 | Match | berry |
We give R the literal string "berry"
This literal is represented as the string value berry
Which is interpreted as the regular expression berry
Which is then evaluated to match the character pattern berry
All of the characters in "berry"
are literals β they match to exactly the characters that they are
Certain other characters, called metacharacters, have non-literal meanings
To illustrate, letβs find all fruit names starting with βaβ
^
is a metacharacter that anchors a pattern at the start of a string
We can find all names that end in "n"
using a different anchor, $
Tip: Remembering ^
and $
Hereβs a useful mnemonic to help you remember that ^
(the exponentiation or βpowerβ symbol) is the start anchor, and $
the end: if you start with power, you end with money. A bleak but accurate commentary on our worldβ¦
Stage | Representation | |
---|---|---|
1 | String literal in R code | "n$" |
2 | String value/regular expression | n$ |
3 | Match | n at the end |
Weβve seen that we can use \
to escape primary meanings in string-land
How do we escape in regexp-land, and why would we need to?
$
character$
is a metacharacter β itβll match to end of all stringsStage | Representation | |
---|---|---|
1 | String literal in R code | "$" |
2 | String value\regular expression | $ |
3 | Match | the end of the string |
$
βs primary meaning, letβs give \
a try!Error: '\$' is an unrecognized escape in character string (<text>:1:30)
Error?! Remember, we go from R-land (string literal) to string-land
In string-land, $
has no secondary meaning to escape into, so we get an error
We need to escape the escape!
Stage | Representation | |
---|---|---|
1 | String literal in R code | "\\$" |
2 | String value/regular expression | \$ |
3 | Match | $ |
The .
metacharacter matches to all characters except the newline character \n
Find all fruit names that end with a any character e
.
character, weβd need to double escapeYou can use the R editor below to complete the following exercises. The editor does not save your work. If youβd prefer, you can also complete exercises in Posit Cloud.
fruit
that start with bl
^
, $
, and .
), find all the fruits that are exactly seven characters longnumbers
vector created above, find all decimal place characters"\\\\"
match to?# exercise 2
str_view(fruit, pattern = "^bl")
# exercise 3
str_view(fruit, pattern = "^.......$")
# exercise 4
str_view(numbers, pattern = "\\.")
# exercise 5
# it matches to the literal \ character
# "\\\\" becomes \\ (two \ get escaped in string-land)
# and \\ (an escaped escape in regexp-land) matches to \
str_view(string = "\\", pattern = "\\\\")
[2] β <apricot>
[3] β <avocado>
[5] β <bell pepper>
[6] β <bilberry>
[7] β <blackberry>
[8] β <blackcurrant>
[9] β <blood orange>
[10] β <blueberry>
[11] β <boysenberry>
[12] β <breadfruit>
[13] β <canary melon>
[14] β <cantaloupe>
[15] β <cherimoya>
[17] β <chili pepper>
[18] β <clementine>
[19] β <cloudberry>
[20] β <coconut>
[21] β <cranberry>
[22] β <cucumber>
[23] β <currant>
... and 32 more
Note: The words
and sentences
vectors
The {stringr}
package includes two other example data objects called words
and sentences
.
There are 980 word strings
[1] "a" "able" "about" "absolute" "accept"
There are 720 sentence strings
[1] "The birch canoe slid on the smooth planks."
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."
[4] "These days a chicken leg is a rare dish."
[5] "Rice is often served in round bowls."
Make a pattern optional (zero or one) with ?
Allow a pattern to repeat one or more times with +
Make a pattern optional or allow it to repeat one or more times with *
[11] β The <boy> was there when the sun rose.
[25] β The beauty of the view stunned the young <boy>.
[423] β The <boy> owed his pal thirty cents.
[591] β Stop whistling and watch the <boys> march.
[634] β It was done before the <boy> could see it.
[663] β The poor <boy> missed the boat again.
[708] β He sent the <boy> on a short errand.
Note: Parentheses in regular expressions
Quantifiers only modify the character or metacharacter immediately preceding them. So in the regular expression boys?
, the ?
quantifier only affects s
. If we wanted this quantifier to modify y
and s
together, weβd need to group them with parentheses like bo(ys)?
.
|
means βorβ in regexp-land [1] β <a>pple
[2] β <a>pricot
[3] β <a>vocado
[28] β <e>ggplant
[29] β <e>lderberry
[53] β <o>live
[54] β <o>range
[79] β <u>gli fruit
[]
, ^
means βanything butβ# Match to any fruit name starting with anything but a, e, i, o, or u
str_view(fruit, pattern = "^[^aeiou]")
[4] β <b>anana
[5] β <b>ell pepper
[6] β <b>ilberry
[7] β <b>lackberry
[8] β <b>lackcurrant
[9] β <b>lood orange
[10] β <b>lueberry
[11] β <b>oysenberry
[12] β <b>readfruit
[13] β <c>anary melon
[14] β <c>antaloupe
[15] β <c>herimoya
[16] β <c>herry
[17] β <c>hili pepper
[18] β <c>lementine
[19] β <c>loudberry
[20] β <c>oconut
[21] β <c>ranberry
[22] β <c>ucumber
[23] β <c>urrant
... and 52 more
[a-z]
, for exampleYou can use the R editor below to complete the following exercises. The editor does not save your work. If youβd prefer, you can also complete exercises in Posit Cloud.
words
that contain any two vowels next to each othernumbers
vector defined above, elements 1, 3, and 5 are valid 9-digit phone numbers. Write a regular expression that will match to only these strings.^.*$
"\\{.+\\}"
"\\\\{4}"
# exercise 6
str_view(words, pattern = "[aeiou]{2}")
# exercise 7
str_view(numbers, pattern = "^[0-9]{3}.[0-9]{3}.[0-9]{4}$")
# exercise 8
# part a
# start of string, zero or more characters, end of string
# part b
# the literal { character followed by any character one or more times followed by the literal } character
# part c
# the literal \ character four times in a row
{stringr}
cheatsheetIf youβre anything like me, youβre going to have a hard time remembering all of this regexp syntax
Luckily, there is an incredibly handy cheatsheet that you can reference. The second page contains all the useful regexp info.
Iβve been casting spells (so to speak) for awhile now and still need to consult the mystic tome (Iβm enjoying this metaphor) quite frequently.
You can use the R editor below to complete the following exercises. The editor does not save your work. If youβd prefer, you can also complete exercises in Posit Cloud.
The following exercises relate to a new regular expression concept called βlook-aroundsβ. Use the {stringr}
cheatsheet (the material related to look-arounds is near the bottom of the second page) to complete these exercises. These exercises use the words
vector.
str_view_all()
will show multiple matches within each string.c
, s
, t
and w
characters not followed by an h
using a look-around. str_view_all()
will show multiple matches within each string. [1] β a
[2] β a<b>le
[3] β a<b>o<u><t>
[4] β a<b>so<l>u<t>e
[5] β a<c>ce<p>t
[6] β a<c>co<u><n>t
[7] β a<c>hi<e><v>e
[8] β a<c>ro<s>s
[9] β a<c>t
[10] β a<c>ti<v>e
[11] β a<c>tu<a><l>
[12] β a<d>d
[13] β a<d>dre<s>s
[14] β a<d>mi<t>
[15] β a<d>ve<r>ti<s>e
[16] β a<f>fe<c>t
[17] β a<f>fo<r>d
[18] β a<f>te<r>
[19] β a<f>te<r>no<o><n>
[20] β a<g>a<i><n>
... and 960 more
[1] β a
[2] β able
[3] β abou<t>
[4] β ab<s>olu<t>e
[5] β a<c><c>ep<t>
[6] β a<c><c>oun<t>
[7] β achieve
[8] β a<c>ro<s><s>
[9] β a<c><t>
[10] β a<c><t>ive
[11] β a<c><t>ual
[12] β add
[13] β addre<s><s>
[14] β admi<t>
[15] β adver<t>i<s>e
[16] β affe<c><t>
[17] β afford
[18] β af<t>er
[19] β af<t>ernoon
[20] β again
... and 960 more
Regular expressions can help us transform datasets when combined with key {stringr}
function.
Filtering with str_detec()
Cleaning character columns with str_replace()
and str_remove()
Creating new column with str_extract()
# A tibble: 87 Γ 14
name height mass hair_color skin_color eye_color birth_year sex
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Luke Skyw⦠172 77 blond fair blue 19 male
2 C-3PO 167 75 <NA> gold yellow 112 none
3 R2-D2 96 32 <NA> white, bl⦠red 33 none
4 Darth Vad⦠202 136 none white yellow 41.9 male
5 Leia Orgaβ¦ 150 49 brown light brown 19 femaβ¦
6 Owen Lars 178 120 brown, gr⦠light blue 52 male
7 Beru Whitβ¦ 165 75 brown light blue 47 femaβ¦
8 R5-D4 97 32 <NA> white, red red NA none
9 Biggs Dar⦠183 84 black light brown 24 male
10 Obi-Wan K⦠182 77 auburn, w⦠fair blue-gray 57 male
# βΉ 77 more rows
# βΉ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
Filtering
# filtering for all characters with droid names
starwars |>
filter(str_detect(name, pattern = "^[A-Z0-9]+-[A-Z0-9]+$"))
# A tibble: 5 Γ 14
name height mass hair_color skin_color eye_color birth_year sex
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none
2 R2-D2 96 32 <NA> white, blue red 33 none
3 R5-D4 97 32 <NA> white, red red NA none
4 IG-88 200 140 none metal red 15 none
5 R4-P17 96 NA none silver, red red, blue NA none
# βΉ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
Mutating
# creating a new column that holds each films acronym
starwars |>
unnest_longer(col = films) |>
# str_extract_all() returns a vectors of extracted characters
mutate(film_acronym = str_extract_all(films,
pattern = "\\b[a-zA-z]{1}"),
# str_flatten() combines vector into single string
film_acronym = map_chr(.x = film_acronym,
.f = \(x) str_flatten(x))) |>
select(name, species, films, film_acronym)
# A tibble: 173 Γ 4
name species films film_acronym
<chr> <chr> <chr> <chr>
1 Luke Skywalker Human A New Hope ANH
2 Luke Skywalker Human The Empire Strikes Back TESB
3 Luke Skywalker Human Return of the Jedi RotJ
4 Luke Skywalker Human Revenge of the Sith RotS
5 Luke Skywalker Human The Force Awakens TFA
6 C-3PO Droid A New Hope ANH
7 C-3PO Droid The Empire Strikes Back TESB
8 C-3PO Droid Return of the Jedi RotJ
9 C-3PO Droid The Phantom Menace TPM
10 C-3PO Droid Attack of the Clones AotC
# βΉ 163 more rows
DSC 210 Data Wrangling