Combine
Join strings
Interpolate
Embed variables in strings
Detect
Find patterns
Extract
Pull out text
Replace
Transform
Change format
strings, regex, and character data w/ base & stringr
2026-02-10
Use single ' or double " quotes to define strings:
Use the combine function c() to create vectors of strings:
Define strings and vectors to use as examples in later slides:
# Strings
sentence <- 'The QUICK brown fox jumps over the LAZY dog.'
favorite_class <- "d2m-R"
messy_string <- " Hello, World! "
# Character vectors
favorite_class_units <- c("d", "2", "m", "-", "R")
fruit <- c("apple", "banana", "cherry", "date", "elderberry")
greeting <- c("hello", "world")
farewell <- c("goodbye", "everyone")
abed_ratings <- c("cool", "cool cool cool", "cool cool", "cool", "sarcastic: ok", "evil: hot")Base R
grep, grepl, sub, gsub, regexpr, gregexpr(pattern, string)sapply() or loops for vectorsstringr
str_* pattern
str_detect, str_replace, str_replace_all(string, pattern)stringr: Simplify working with strings![]()
stringr:
Simplify working with strings
str_* naming pattern
Combine
Join strings
Interpolate
Embed variables in strings
Detect
Find patterns
Extract
Pull out text
Replace
Modify text
Transform
Change format
Concatenate: Join multiple strings together into one string
paste(): space separator by defaultpaste0(): no separatorstringr:
str_c(): no separator, NAs propagateCombine c() vs. concatenate paste() / str_c():
Base:
Both paste() and str_c() are vectorized (can operate on strings or vectors of strings).
c("apple", "banana", "cherry", "date", "elderberry")"apple", "banana", "cherry", "date", "elderberry"Base: Concat with paste()’s sep arg, collapse with collapse arg:
[1] "apple" "banana" "cherry" "date" "elderberry"
[1] "apple, banana, cherry, date, elderberry"
[1] "apple, banana, cherry, date, elderberry"
stringr: Collapse vectors with str_flatten’s collapse arg, no sep arg:
sprintf() requires substitutions for the placeholders as additional argumentsBase: Interpolate with placeholders using sprintf()
%s (string), %d (integer), %f (float)Does the pattern exist? : Return logical TRUE or FALSE
Base: Find pattern with grepl() (=grep logical)
Use str_detect() in tidy pipelines:
Where does the pattern appear first? : Return position of first match
Base: Find first match with regexpr(), return an integer with attributes
Where does the pattern appear anywhere? : Return integer position(s) of all matches
Base: Find all matches with gregexpr(), return list of integers
How many times does the pattern appear? : Return integer count of matches
Base: Count matches with gregexpr() + lengths()
str_count is vectorized:
Extract strings that match pattern: Return strings with matched pattern
Base: Return matching strings with grep() and value = TRUE
stringr: Return matching strings with str_subset()
str_subset() is a shortcut for str_detect() + subsetting
Extract by position: Return substring based on character positions
Base: Use substr() to extract substring by character positions
substr(x, start, stop); positive indices only
stringr: Use str_sub() to extract substring by character positions, supports negative indexing
str_sub(x, start, end); accepts negative indices
Extract pattern from strings : Return matched pattern as string
Base: Extract pattern with gregexpr() (find positions) then regmatches() (extract matches by positions)
Extract regex pattern from strings : Return matched non-literal pattern as string
Base: Extract pattern with gregexpr() (find positions) then regmatches() (extract matches by positions)
[1] "The"
[[1]]
[1] "The" "QUICK" "brown" "fox" "jumps" "over" "the" "LAZY" "dog"
[[1]]
[1] "QUICK" "LAZY"
stringr: Extract pattern with str_extract() (first match) or str_extract_all() (all matches)
Replace patterns in strings : Return modified string with pattern replaced by new text
Base: sub() replaces first match, gsub() replaces all matches
stringr: Use str_replace() to replace first match, str_replace_all() to replace all matches
Remove patterns from strings : Return modified string with pattern removed
Base: Remove pattern with sub() or gsub() and replacing with empty string ""
stringr: Remove pattern with str_replace() or str_replace_all()
Use regex patterns in replacement with both base and stringr functions to do more complex replacements. For example, you can use regex groups to reorder parts of a string:
Base
stringr
[1] "WORD WORD WORD WORD WORD WORD WORD WORD WORD."
[1] "Th*_Q**CK_br*wn_f*x_j*mps_*v*r_th*_L*ZY_d*g."
Clean up messy text : Return modified string with leading/trailing whitespace removed
Base: Remove leading/trailing whitespace with trimws()
stringr: Remove leading/trailing whitespace with str_trim()
Pad strings to fixed width : Return modified string with padding added to reach specified width
Base: Pad strings with sprintf() and width specifier
Change letter case : Return modified string with case converted
Base: Convert to all upper/lower case with toupper() & tolower()
stringr: Convert to all upper/lower case with str_to_upper() & str_to_lower()
Literal characters: Match exactly as written
Metacharacters have special meaning, like defining what kinds of characters to recognize, how many times to match, or where in the string to look. Common metacharacters:
| Metacharacter | Meaning |
|---|---|
. |
Any single character except newline |
* |
0 or more of the preceding element |
+ |
1 or more of the preceding element |
? |
0 or 1 of the preceding element |
[] |
Character class (matches any one character inside) |
{} |
Quantifier (specifies number of occurrences) |
^ |
Start of string |
$ |
End of string |
\ |
Escape character (treat next character literally) |
| |
Alternation (matches the pattern before or after) |
Escaping allows you to match metacharacters as literal characters by preceding them with the escape metacharacter. - Match literal period: \. → "\\." in R - Match literal asterisk: \* → "\\*" in R
Character classes match any one character from a set. Common character classes:
| Class | Meaning |
|---|---|
[abc] |
a, b, or c |
[^abc] |
NOT a, b, or c |
[a-z] |
any lowercase letter |
\d |
digit |
\w |
word character (letter, digit, or underscore) |
\s |
whitespace character (space, tab, newline) |
Quantifiers specify how many times the preceding element should be matched:
| Quantifier | Meaning |
|---|---|
* |
0 or more of the preceding element |
+ |
1 or more of the preceding element |
? |
0 or 1 of the preceding element |
{n} |
exactly n of the preceding element |
{n,} |
n or more of the preceding element |
{n,m} |
between n and m of the preceding element |
Use quantifiers in stringr functions to specify how many occurrences to replace, locate, count, etc.:
.*
.*?
[1] "The quick brown fox jumps over the lazy dog. The dog"
[1] "The quick brown fox jumps over the lazy dog"
Anchors specify where in the string to look for a match:
| Anchor | Meaning |
|---|---|
^ |
Start of string[^Note that the carat ^ has two uses: as an anchor for the start of the string, and as a negation symbol when used inside square brackets for character classes.] |
$ |
End of string |
\b |
Word boundary (position between a word character and a non-word character |
\B |
Non-word boundary (position between two word characters or two non-word characters) |
Use word boundaries for whole word matching:
Validation pattern example (e.g., email, phone)
[1] TRUE
[1] FALSE
Capture groups to extract specific parts of a matched pattern and back-reference them in replacement or later in the regex.
(): Create capture group\1, \2, etc.: Reference captured groups in replacementReorder US dates MM/DD/YYYY to YYYY-MM-DD using capture groups and back-references:
[1] "The date is 2025-12-31. Happy New Year!"
[1] "The date is 25-6-7. Happy birthday!"
Create a function for the date reordering that can handle both 1 or 2 digit month/day and 2 or 4 digit year formats as input but always deliver a consistent YYYY-MM-DD output format. Use capture groups to extract the components and back-references to reorder them in the desired format. Test your function on various date strings to ensure it works correctly.
#| label: regex-challenge
# Create function `reorder_date()`
reorder_date <- function(date_string) {
# Define regex pattern to capture month, day, and year with flexible digit counts
# Use str_replace to reorder the date components to YYYY-MM-DD
}
# Test with multiple string types
reorder_date("4/5/2023") # Expected: "2023-04-05"Even challengier: Handle multiple date formats by adding arguments for the user to specify the input and/or format.
D2M-R I | Week 6 & 7