Tidyverse Wrangling with Strings

strings, regex, and character data w/ base & stringr

2026-02-10

Strings

Character data

Character data: R datatype for text
Character: Smallest unit of text
String: Sequence of zero or more characters
Character vector: Collection of one or more strings
- A single string is a character vector of length 1

Create strings and character vectors

Use single ' or double " quotes to define strings:

favorite_class <- "d2m-R"

typeof(favorite_class) # data type

[1] "character"

class(favorite_class) # object class

[1] "character"

length(favorite_class) # no. elements

[1] 1

str_length(favorite_class) # no. characters

[1] 5

Use the combine function c() to create vectors of strings:

favorite_class_units <- c("d", "2", "m", "-", "R")

typeof(favorite_class_units) # data type

[1] "character"

class(favorite_class_units) # object class

[1] "character"

length(favorite_class_units) # no. elements

[1] 5

str_length(favorite_class_units) # no. char.

[1] 1 1 1 1 1

Define Examples

Define strings and vectors to use as examples in later slides:

# Strings
sentence <- 'The QUICK brown fox jumps over the LAZY dog.'
favorite_class <- "d2m-R"
messy_string <- "   Hello,   World!   "

# Character vectors
favorite_class_units <- c("d", "2", "m", "-", "R")
fruit <- c("apple", "banana", "cherry", "date", "elderberry")
greeting <- c("hello", "world")
farewell <- c("goodbye", "everyone")
abed_ratings <- c("cool", "cool cool cool", "cool cool", "cool", "sarcastic: ok", "evil: hot")

Base R vs tidyverse

Base R

Inconsistent naming
- grep, grepl, sub, gsub, regexpr, gregexpr
Inconsistent argument order
- Commonly (pattern, string)
Not vectorized by default
- Often need sapply() or loops for vectors

stringr

Consistent str_* pattern
- str_detect, str_replace, str_replace_all
Consistent order
- Always (string, pattern)
Vectorized by default
- Works with strings or vectors of strings

Base R vs tidyverse approach: Like most things in the tidyverse, stringr improves upon base r options by providing a consistent naming scheme and argument order. Contrast base R’s inconsistent naming (grep, sub, gsub) with stringr’s consistent str_* pattern. stringr is vectorized by default and uses consistent argument ordering (string first, pattern second).

What does it mean to be “vectorized by default?”

It means that stringr functions are designed to work with vectors of strings without needing to use sapply() or loops. You can pass a character vector to a stringr function, and it will operate on each element of the vector automatically, returning a vector of results. In contrast, many base R string functions are not vectorized and require additional steps to apply them to each element of a character vector.

`stringr`: Simplify working with strings

Combine

Join strings

Interpolate

Embed variables in strings

Detect

Find patterns

Extract

Pull out text

Replace

Modify text

Transform

Change format

Create and Combine Strings

Combine: Concatenate

Concatenate: Join multiple strings together into one string

Take strings as separate arguments; return single string
Base:
- paste(): space separator by default
- paste0(): no separator
stringr:
- str_c(): no separator, NAs propagate

Compare combine vs concatenate

Combine c() vs. concatenate paste() / str_c():

Base:

# combine strings into vector
# not concatenated
c("hello", "world")

[1] "hello" "world"

Base:

# concatenate strings into a string
paste("hello", "world")

[1] "hello world"

# vector treated as 1 argument
# not concatenated
paste(greeting)

[1] "hello" "world"

stringr:

# concatenate strings into 1 string 
# no separator by default
str_c("hello", "world")

[1] "helloworld"

# vector treated as 1 argument
# not concatenated
str_c(greeting)

[1] "hello" "world"

Compare separator behavior

Base:

# default space separator
paste("hello", "world")

[1] "hello world"

# default no separator
paste0("hello", "world")

[1] "helloworld"

stringr:

# default no separator
str_c("hello", "world")

[1] "helloworld"

# custom space separator
str_c("hello", "world", sep = " ")

[1] "hello world"

Compare NA handling

Base:

# NA treated as "NA" (string)
paste("hello", NA)

[1] "hello NA"

stringr:

# NA propagates to result
str_c("hello", NA)

[1] NA

Compare vectorized (same)

Both paste() and str_c() are vectorized (can operate on strings or vectors of strings).

greeting_vector <- c("hello", "goodbye")
audience_vector <- c("world", "everyone")

# vectorized with space separator
paste(greeting_vector, audience_vector)

[1] "hello world"      "goodbye everyone"

# vectorized with no separator
str_c(greeting_vector, audience_vector)

[1] "helloworld"      "goodbyeeveryone"

Combine: Collapse

Collapse: Join elements of a string vector into a single string
Take vector of strings as single argument; return single string
- String vector: c("apple", "banana", "cherry", "date", "elderberry")
- Multiple string arguments: "apple", "banana", "cherry", "date", "elderberry"

Combine: Collapse

Base: Concat with paste()’s sep arg, collapse with collapse arg:

# sep does not affect vector length > 1
paste(fruit, sep = ", ")

[1] "apple"      "banana"     "cherry"     "date"       "elderberry"

# collapse with comma
paste(fruit, collapse = ", ")

[1] "apple, banana, cherry, date, elderberry"

# sep ignored when collapse is used
# because sep can't apply to a vector length > 1
paste(fruit, sep = "SEPARATOR!!!", collapse = ", ")

[1] "apple, banana, cherry, date, elderberry"

stringr: Collapse vectors with str_flatten’s collapse arg, no sep arg:

 # default no separator
str_flatten(fruit)

[1] "applebananacherrydateelderberry"

# collapse with comma
str_flatten(fruit, collapse = ", ")

[1] "apple, banana, cherry, date, elderberry"

# collapse with space
str_flatten(fruit, collapse = " ")

[1] "apple banana cherry date elderberry"

Separators are for combining multiple strings, each provided as a separate argument. Collapse is for combining elements of a single vector into one string.

What’s the difference between sep and collapse? - sep: Specifies the string to use to separate individual elements when concatenating multiple strings together. It is used when you want to combine multiple strings into a single string with a specific separator between them. For example, str_c("hello", "world", sep = " ") will produce “hello world”. - collapse: Specifies the string to use to separate elements when collapsing a vector of strings into a single string. It is used when you have a vector of strings and want to combine them into one string with a specific separator between each element. For example, str_flatten(c("apple", "banana", "cherry"), collapse = ", ") will produce “apple, banana, cherry”.

Interpolate

Interpolate: Embed variable values or expressions directly within strings
Take string with placeholders/expressions; return single string with values filled in
- Base sprintf() requires substitutions for the placeholders as additional arguments

Base: Interpolate with placeholders using sprintf()

Use placeholders: %s (string), %d (integer), %f (float)

name <- "Natalie"
age <- 99

# Arguments will replace placeholders in order
sprintf("My name is %s and I am %d years old.", name, age)

[1] "My name is Natalie and I am 99 years old."

stringr: Interpolate directly with str_glue()

Evaluate R expressions directly in strings using {}

name <- "Jenn Allen-Pho"
age <- 67

str_glue("My name is {name} and I am {age} years old.")

My name is Jenn Allen-Pho and I am 67 years old.

Interpolate: Calculations and functions

Include simple calculations:

x <- 10
str_glue("The value of x is {x} 
    and x squared is {x^2}.")

The value of x is 10 
and x squared is 100.

Include embedded R code:

a <- 5
b <- 3

str_glue("The sum of a and b is {a + b} 
    and their mean is {mean(c(a,b))}.")

The sum of a and b is 8 
and their mean is 4.

String Detection and Matching

Detect: Identify patterns

Does the pattern exist? : Return logical TRUE or FALSE

fruit

[1] "apple"      "banana"     "cherry"     "date"       "elderberry"

Base: Find pattern with grepl() (=grep logical)

grepl("a", "banana")

[1] TRUE

grepl("z", "banana")

[1] FALSE

grepl("err", fruit)

[1] FALSE FALSE  TRUE FALSE  TRUE

!grepl("err", fruit)

[1]  TRUE  TRUE FALSE  TRUE FALSE

stringr: Find pattern with str_detect()

str_detect("banana", "an")

[1] TRUE

str_detect("banana", "z")

[1] FALSE

str_detect(fruit, "err")

[1] FALSE FALSE  TRUE FALSE  TRUE

str_detect(fruit, "err", negate = TRUE)

[1]  TRUE  TRUE FALSE  TRUE FALSE

Tidy string detection

Use str_detect() in tidy pipelines:

# Using str_detect() in filter()
data.frame(fruit) |> 
    # keep fruits containing "a"
    filter(str_detect(fruit, "a"))

   fruit
1  apple
2 banana
3   date

Detection is case-sensitive by default, but can be made case-insensitive with regex() wrapper and ignore_case arg:

str_detect("Banana", "a")

[1] TRUE

str_detect("Banana", "A")

[1] FALSE

# Use regex to make case-insensitive
str_detect("Banana", 
    regex("A", ignore_case = TRUE))

[1] TRUE

Detect: Locate patterns (first)

Where does the pattern appear first? : Return position of first match

Base: Find first match with regexpr(), return an integer with attributes

# regexpr() returns start position of first match
base_first_match <- regexpr("an", "banana") 
base_first_match

[1] 2
attr(,"match.length")
[1] 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

class(base_first_match)

[1] "integer"

stringr: Find first match with str_locate(), return a matrix with start/end columns

# str_locate() returns matrix with start/end of first match
stringr_first_match <- str_locate("banana", "an") 
stringr_first_match

     start end
[1,]     2   3

class(stringr_first_match)

[1] "matrix" "array"

Detect: Locate patterns (all)

Where does the pattern appear anywhere? : Return integer position(s) of all matches

Base: Find all matches with gregexpr(), return list of integers

# gregexpr() returns positions of all matches 
# as list of integer vectors with attributes
base_all_match <- gregexpr("an", "banana") 
base_all_match

[[1]]
[1] 2 4
attr(,"match.length")
[1] 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

class(base_all_match)

[1] "list"

stringr: Find all matches with str_locate_all, return list of matrices

# str_locate_all() returns start and end positions
# of all matches as list of matrices
stringr_all_match <- str_locate_all("banana", "an")
stringr_all_match

[[1]]
     start end
[1,]     2   3
[2,]     4   5

class(stringr_all_match)

[1] "list"

Detect: Count patterns

How many times does the pattern appear? : Return integer count of matches

Base: Count matches with gregexpr() + lengths()

# Use gregexpr() to find all matches, 
# then lengths() to count
lengths(
    gregexpr("an", "banana")
    )

[1] 2

stringr: Count matches with str_count()

# counts occurrences in single string
str_count("banana", "an")

[1] 2

str_count is vectorized:

# Count occurrences of "cool" in 
# each string of abed_ratings vector
str_count(abed_ratings, "cool")

[1] 1 3 2 1 0 0

Example: Count words with regex

sentence

[1] "The QUICK brown fox jumps over the LAZY dog."

# Count words using word boundary pattern
str_count(sentence, "\\b\\w+\\b")

[1] 9

String Manipulation

Extract: Subset by pattern

Extract strings that match pattern: Return strings with matched pattern

Base: Return matching strings with grep() and value = TRUE

grep("a", fruit, value = TRUE)

[1] "apple"  "banana" "date"

stringr: Return matching strings with str_subset()

str_subset(fruit, "a")

[1] "apple"  "banana" "date"

str_subset() is a shortcut for str_detect() + subsetting

# detect matches returns a logical vector
# then subset with [] indexing on the TRUE values
fruit[str_detect(fruit, "a")]

[1] "apple"  "banana" "date"

Extract: Subset by position

Extract by position: Return substring based on character positions

sentence

[1] "The QUICK brown fox jumps over the LAZY dog."

Base: Use substr() to extract substring by character positions

substr(x, start, stop); positive indices only

# extract characters 1-9
substr(sentence, 1, 9)

[1] "The QUICK"

stringr: Use str_sub() to extract substring by character positions, supports negative indexing

str_sub(x, start, end); accepts negative indices

# extract characters 1-9
str_sub(sentence, 1, 9)

[1] "The QUICK"

# extract last 4 characters
str_sub(sentence, -4, -1)

[1] "dog."

Example use case: Extract specific parts of a string and interpolate with str_glue()

pos_start <- 1
neg_end <- -1

str_glue(
    "Abridged sentence:
    {str_sub(sentence, pos_start, 9)} {str_sub(sentence, -4, neg_end)}"
    )

Abridged sentence:
The QUICK dog.

substr() is an example of where base string functions can be inconsistent. Since there is no pattern matching in this kind of extraction, there is no pattern to come first, so the string to search comes first. This is perfectly reasonable, but it can be confusing when you are used to the pattern-first convention of other string functions. In contrast, str_sub() follows the consistent structure of string as first argument. This consistency in argument order helps reduce confusion and makes it easier to remember how to use different functions within the stringr package. It also makes all stringr functions play nice in pipelines, where the result of one line is implicitly the first argument of the next.

Coding challenge:

Use regex, str_glue, and str_sub to detect the first occurrence of an animal and of an adjective in the sentence, extract them, and create a new string with them. For example, if the sentence is “The QUICK brown fox jumps over the LAZY dog.”, you could extract “fox” (animal) and “QUICK” (adjective) and create a new string like “Look at the QUICK fox!”

R doesn’t know what “fox” or “LAZY” mean, so you’ll need to define what counts as an animal and an adjective for this challenge. The simplest way to do this is just to name the words directly in a vector or within the regex pattern, but you can try for a more adaptable or creative definition if you want to make it more interesting!

Challengier:

Create a function that lets your user decide whether the resulting sentence should use the first or second occurrence of the animal and adjective. For example, if the user chooses “first”, the function would extract “QUICK” and “fox” to create “Look at the QUICK fox!”, but if they choose “second”, it would extract “LAZY” and “dog” to create “Look at the LAZY dog!”.

Yet even more challengier:

Allow your user to specify first animal and second adjective (or vice versa), or to specify a different sentence to analyze. Make sure your function can handle cases where the specified occurrence doesn’t exist (e.g., if the user asks for the second animal but there is only one animal in the sentence) and provide informative error messages.

Extract: Extract by pattern

Extract pattern from strings : Return matched pattern as string

Base: Extract pattern with gregexpr() (find positions) then regmatches() (extract matches by positions)

# extract first "o"
regmatches(sentence, regexpr("o", sentence))

[1] "o"

# extract all "o"
regmatches(sentence, gregexpr("o", sentence))

[[1]]
[1] "o" "o" "o" "o"

stringr: Extract pattern with str_extract() (first match) or str_extract_all() (all matches)

# extract first "o"
str_extract(sentence, "o")

[1] "o"

# extract all "o"
str_extract_all(sentence, "o")

[[1]]
[1] "o" "o" "o" "o"

Extract: Extract by pattern with regex

Extract regex pattern from strings : Return matched non-literal pattern as string

Base: Extract pattern with gregexpr() (find positions) then regmatches() (extract matches by positions)

# extract first word
regmatches(sentence, regexpr("\\b\\w+\\b", sentence))

[1] "The"

# extract all words
regmatches(sentence, gregexpr("\\b\\w+\\b", sentence))

[[1]]
[1] "The"   "QUICK" "brown" "fox"   "jumps" "over"  "the"   "LAZY"  "dog"

# extract capitalized words
regmatches(sentence, gregexpr("\\b[A-Z]+\\b", sentence))

[[1]]
[1] "QUICK" "LAZY"

stringr: Extract pattern with str_extract() (first match) or str_extract_all() (all matches)

# extract first word
str_extract(sentence, "\\b\\w+\\b")

[1] "The"

# extract all words
str_extract_all(sentence, "\\b\\w+\\b")

[[1]]
[1] "The"   "QUICK" "brown" "fox"   "jumps" "over"  "the"   "LAZY"  "dog"

# extract capitalized words
str_extract_all(sentence, "\\b[A-Z]+\\b")

[[1]]
[1] "QUICK" "LAZY"

Replace: Exchange 2 patterns

Replace patterns in strings : Return modified string with pattern replaced by new text

Base: sub() replaces first match, gsub() replaces all matches

# replace first "a" with "X"
sub("a", "X", "banana")

[1] "bXnana"

# replace all "a" with "Y"
gsub("a", "Y", "banana")

[1] "bYnYnY"

# replace "err" with "Z" in all strings containing "err"
gsub("err", "Z", fruit)

[1] "apple"    "banana"   "chZy"     "date"     "elderbZy"

stringr: Use str_replace() to replace first match, str_replace_all() to replace all matches

# replace first "a" with "X"
str_replace("banana", "a", "X")

[1] "bXnana"

# replace all "a" with "Y"
str_replace_all("banana", "a", "Y")

[1] "bYnYnY"

# replace "err" with "Z" in all strings containing "err"
str_replace_all(fruit, "err", "Z")

[1] "apple"    "banana"   "chZy"     "date"     "elderbZy"

Replace: Remove patterns

Remove patterns from strings : Return modified string with pattern removed

Base: Remove pattern with sub() or gsub() and replacing with empty string ""

# remove first "a" from "banana"
sub("a", "", "banana")

[1] "bnana"

# remove "err" from all strings
gsub("err", "", fruit)

[1] "apple"   "banana"  "chy"     "date"    "elderby"

stringr: Remove pattern with str_replace() or str_replace_all()

# remove first "a" from "banana"
str_replace("banana", "a", "")

[1] "bnana"

# remove "err" from all strings
str_replace_all(fruit, "err", "")

[1] "apple"   "banana"  "chy"     "date"    "elderby"

stringr: Remove pattern with str_remove() or str_remove_all()

# remove first "a" from "banana"
str_remove("banana", "a")

[1] "bnana"

# remove "err" from all strings
str_remove_all(fruit, "err")

[1] "apple"   "banana"  "chy"     "date"    "elderby"

Replace: regex patterns

Use regex patterns in replacement with both base and stringr functions to do more complex replacements. For example, you can use regex groups to reorder parts of a string:

Base

# replace all words with "WORD"
gsub("\\b\\w+\\b", "WORD", sentence)

[1] "WORD WORD WORD WORD WORD WORD WORD WORD WORD."

# Relace all vowels with "*" and spaces with "_"
gsub(" ", "_", gsub("[aeiouAEIOU]", "*", sentence))

[1] "Th*_Q**CK_br*wn_f*x_j*mps_*v*r_th*_L*ZY_d*g."

stringr

# replace all words with "WORD"
str_replace_all(sentence, "\\b\\w+\\b", "WORD")

[1] "WORD WORD WORD WORD WORD WORD WORD WORD WORD."

# Relace all vowels with "*" and spaces with "_"
str_replace_all(sentence, "[aeiouAEIOU]", "*") |> 
  str_replace_all(" ", "_")

[1] "Th*_Q**CK_br*wn_f*x_j*mps_*v*r_th*_L*ZY_d*g."

Transform: Trim whitespace

Clean up messy text : Return modified string with leading/trailing whitespace removed

messy_string <- "   Hello,   World!   "

Base: Remove leading/trailing whitespace with trimws()

# remove leading & trailing whitespace
trimws(messy_string)

[1] "Hello,   World!"

# remove leading whitespace only
trimws(messy_string, which = "left")

[1] "Hello,   World!   "

stringr: Remove leading/trailing whitespace with str_trim()

# remove leading & trailing whitespace
str_trim(messy_string)

[1] "Hello,   World!"

# remove trailing whitespace only
str_trim(messy_string, side = "right")

[1] "   Hello,   World!"

stringr: Also collapse internal whitespace with str_squish()

# removing both leading and trailing
# AND ALSO collapse any internal 
# whitespace to a single space
str_squish(messy_string)

[1] "Hello, World!"

Transform: Pad strings

Pad strings to fixed width : Return modified string with padding added to reach specified width

Base: Pad strings with sprintf() and width specifier

# pad "42" to width of 5 with spaces
sprintf("%5s", "42")

[1] "   42"

# pad "42" to width of 5 with zeros
sprintf("%05s", "42")

[1] "00042"

stringr: Pad strings with str_pad() and width, side, pad args

# pad "42" to width of 5 with spaces on the left
str_pad("42", width = 5, side = "left")

[1] "   42"

# pad "42" to width of 5 with zeros on the right
str_pad("42", width = 5, side = "right", pad = "0")

[1] "42000"

Transform: Convert case

Change letter case : Return modified string with case converted

sentence

[1] "The QUICK brown fox jumps over the LAZY dog."

Base: Convert to all upper/lower case with toupper() & tolower()

toupper(sentence)

[1] "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG."

tolower(sentence)

[1] "the quick brown fox jumps over the lazy dog."

stringr: Convert to all upper/lower case with str_to_upper() & str_to_lower()

str_to_upper(sentence)

[1] "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG."

str_to_lower(sentence)

[1] "the quick brown fox jumps over the lazy dog."

stringr: Convert to title/sentence case with str_to_title() & str_to_sentence()

# Captialize Each Word
str_to_title(sentence)

[1] "The Quick Brown Fox Jumps Over The Lazy Dog."

# Capitalize first word.
str_to_sentence(sentence)

[1] "The quick brown fox jumps over the lazy dog."

Regular Expressions

Regex overview

Literal characters: Match exactly as written

Metacharacters have special meaning, like defining what kinds of characters to recognize, how many times to match, or where in the string to look. Common metacharacters:

Metacharacter	Meaning
`.`	Any single character except newline
`*`	0 or more of the preceding element
`+`	1 or more of the preceding element
`?`	0 or 1 of the preceding element
`[]`	Character class (matches any one character inside)
`{}`	Quantifier (specifies number of occurrences)
`^`	Start of string
`$`	End of string
`\`	Escape character (treat next character literally)
`\|`	Alternation (matches the pattern before or after)

Escaping allows you to match metacharacters as literal characters by preceding them with the escape metacharacter. - Match literal period: \. → "\\." in R - Match literal asterisk: \* → "\\*" in R

Match with metacharacter:

str_detect("filetxt", ".")

[1] TRUE

str_detect("", ".")

[1] FALSE

Literal match with escaped metacharacter:

str_detect("file.txt", "\\.")

[1] TRUE

str_detect("filetxt", "\\.")

[1] FALSE

Character classes

Character classes match any one character from a set. Common character classes:

Class	Meaning
`[abc]`	a, b, or c
`[^abc]`	NOT a, b, or c
`[a-z]`	any lowercase letter
`\d`	digit
`\w`	word character (letter, digit, or underscore)
`\s`	whitespace character (space, tab, newline)

str_detect("apple", "[aeiou]")

[1] TRUE

str_detect("apple", "[^aeiou]")

[1] TRUE

str_detect("apple", "\\d")

[1] FALSE

str_detect("apple", "\\w")

[1] TRUE

str_detect("apple", "\\s")

[1] FALSE

Quantifiers

Quantifiers specify how many times the preceding element should be matched:

Quantifier	Meaning
`*`	0 or more of the preceding element
`+`	1 or more of the preceding element
`?`	0 or 1 of the preceding element
`{n}`	exactly n of the preceding element
`{n,}`	n or more of the preceding element
`{n,m}`	between n and m of the preceding element

str_detect("aaab", "a*")

[1] TRUE

str_detect("aaab", "a+")

[1] TRUE

str_detect("aaab", "a?")

[1] TRUE

str_detect("aaab", "a{3}")

[1] TRUE

str_detect("aaab", "a{2,}")

[1] TRUE

str_detect("aaab", "a{2,3}")

[1] TRUE

Example: Quantifiers in stringr functions

Use quantifiers in stringr functions to specify how many occurrences to replace, locate, count, etc.:

# replace 2 or 3 a's with "X"
str_replace_all("aaab", "a{2,3}", "X")

[1] "Xb"

# locates positions of 2 or 3 a's 
str_locate_all("aaab", "a{2,3}")

[[1]]
     start end
[1,]     1   3

# counts occurrences of 2 or 3 a's
str_count("aaab", "a{2,3}")

[1] 1

Greedy vs lazy matching

Greedy: match as much as possible with .*
- Stop at the end of the string or the last possible match
Lazy: match as little as possible with .*?
- Stop at the first possible match

double_dog <- "The quick brown fox jumps over the lazy dog. The dog was displeased."

# Greedy match (matches everything between first "The" and last "dog")
str_extract(double_dog, "The.*dog")

[1] "The quick brown fox jumps over the lazy dog. The dog"

# Lazy match (matches the shortest string between first "The" and first "dog")
str_extract(double_dog, "The.*?dog")

[1] "The quick brown fox jumps over the lazy dog"

Anchors & boundaries

Anchors specify where in the string to look for a match:

Anchor	Meaning
`^`	Start of string[^Note that the carat `^` has two uses: as an anchor for the start of the string, and as a negation symbol when used inside square brackets for character classes.]
`$`	End of string
`\b`	Word boundary (position between a word character and a non-word character
`\B`	Non-word boundary (position between two word characters or two non-word characters)

Match partial string with ^ or $:

str_detect("banana", "^ba")

[1] TRUE

str_detect("banana", "na$")

[1] TRUE

Match entire string with ^ and $:

str_detect("banana", "^banana$")

[1] TRUE

str_detect("banana", "^ban$")

[1] FALSE

Example: Whole-Word Matching

Use word boundaries for whole word matching:

sea_cat <- "The cat is on the catamaran. She's a weird cat."

str_count(sea_cat, " cat ")

[1] 1

str_count(sea_cat, "\\bcat\\b")

[1] 2

str_count(sea_cat, "cat")

[1] 3

Example: Pattern Validation

Validation pattern example (e.g., email, phone)

# Simple email validation pattern: one or more word characters, @, one or more word characters, ., two or more letters
email_pattern <- "^\\w+@\\w+\\.\\w{2,}$"

str_detect("name@school.edu", email_pattern) # returns TRUE for valid email format

[1] TRUE

str_detect("invalid email@spammer,com", email_pattern) # returns FALSE for invalid email format

[1] FALSE

Capture groups & back-references

Capture groups to extract specific parts of a matched pattern and back-reference them in replacement or later in the regex.

(): Create capture group
\1, \2, etc.: Reference captured groups in replacement

Use back-references to reorder captured groups:

# Capture the two animals (fox, dog) and swap their placement in the string
str_replace(sentence, "(fox)(.*)(dog)", "\\3\\2\\1")

[1] "The QUICK brown dog jumps over the LAZY fox."

Example: Reorder US date format

Reorder US dates MM/DD/YYYY to YYYY-MM-DD using capture groups and back-references:

new_year <- "The date is 12/31/2025. Happy New Year!"
birthday <- "The date is 6/7/25. Happy birthday!"

# Capture month, day, year and reorder to YYYY-MM-DD
str_replace(new_year, "(\\d{2})/(\\d{2})/(\\d{4})", "\\3-\\1-\\2")

[1] "The date is 2025-12-31. Happy New Year!"

# account for the possibility of 1-digit month/day and 2-digit year
str_replace(birthday, "(\\d{1,2})/(\\d{1,2})/(\\d{2,4})", "\\3-\\1-\\2")

[1] "The date is 25-6-7. Happy birthday!"

Regex Challenge

Create a function for the date reordering that can handle both 1 or 2 digit month/day and 2 or 4 digit year formats as input but always deliver a consistent YYYY-MM-DD output format. Use capture groups to extract the components and back-references to reorder them in the desired format. Test your function on various date strings to ensure it works correctly.

#| label: regex-challenge

# Create function `reorder_date()`
reorder_date <- function(date_string) {
    
  # Define regex pattern to capture month, day, and year with flexible digit counts

  # Use str_replace to reorder the date components to YYYY-MM-DD

}

# Test with multiple string types
reorder_date("4/5/2023") # Expected: "2023-04-05"

Even challengier: Handle multiple date formats by adding arguments for the user to specify the input and/or format.

Tidyverse Wrangling with Strings

Strings

Character data

Create strings and character vectors

Define Examples

Base R vs tidyverse

stringr: Simplify working with strings

Create and Combine Strings

Combine: Concatenate

Compare combine vs concatenate

Compare separator behavior

Compare NA handling

Compare vectorized (same)

Combine: Collapse

Combine: Collapse

Interpolate

Interpolate: Calculations and functions

String Detection and Matching

Detect: Identify patterns

Tidy string detection

Detect: Locate patterns (first)

Detect: Locate patterns (all)

Detect: Count patterns

String Manipulation

Extract: Subset by pattern

Extract: Subset by position

Extract: Extract by pattern

Extract: Extract by pattern with regex

Replace: Exchange 2 patterns

Replace: Remove patterns

Replace: regex patterns

Transform: Trim whitespace

Transform: Pad strings

Transform: Convert case

Regular Expressions

Regex overview

Character classes

Quantifiers

Example: Quantifiers in stringr functions

Greedy vs lazy matching

Anchors & boundaries

Example: Whole-Word Matching

Example: Pattern Validation

Capture groups & back-references

Example: Reorder US date format

Regex Challenge

`stringr`: Simplify working with strings