a plain english guide to r and some packages

Author

hunter cheng

Published

February 4, 2026

basics

value types

name	abbr.	definition	examples	notes
logical	lgl	booleans, binary yes/no	`TRUE` `FALSE`
integer	int	self-explanatory	`1L` `20L`	needs L after number to force integer type
double	dbl	non-integer numbers	`1` `20.0`	default numeric storage type
complex	cplx	complex numbers like imaginaries	`i`	ignore for now
character	chr	strings, located within quotation marks	`"a"` `"marky mark"`

functions for value types

typeof() / class()
- returns value type
- note: typeof() will differentiate between integer and double. class() will return numeric for both
as.logical(), as.character(), as.double(), as.numeric()
- forces the data you enter into a specific value type
- will return error if the result would be nonsensical
  - e.g., as.logical("Mary") would be impossible

types of data structures

structure	dimensions	mixed elements?	notes
vector	1d	no	all elements must be same type
matrix	2d	no	all elements must be same type
dataframe	2d	yes (column)	each column is a vector
list	flexible	yes	can contain all element types

vectorized functions vs. loops

to put it very simply, vectorized functions act once for an entire vector, whereas loops have to call functions for each element of a vector
this means vectorized functions are often faster and less intense power-wise for your computer
why does this matter?
- to my best understanding, at our level? it doesn’t
  - unless you’re really invested in understanding the fine points of r or if you’re doing massive calculations that your computer can’t handle
- the only relevance for us is recognizing when certain functions will need a vector input, like ifelse() (vectorized) vs. if () ... else if () (non-vectorized)

functions for data structures

length()
- takes a dataset, returns the number of elements in the data
class()
- takes a dataset and returns the type of data structure
- note: typeof() isn’t as helpful with data structures
as.array(), as.data.frame(), as.matrix(), as.list()
- forces a dataset into a specific data structure type

Note

go here for vector subsetting

one-dimensional
all elements are same type (i.e. all numeric/character/etc.)

vector_structure <- c(25, 30)
vector_structure

[1] 25 30

includes scalars, which are vectors with a length of one
- e.g., c(1)
in r, will be recycled (repeated) during operations with other vectors if lengths are not equivalent
- returns an warning if the length of the shorter vector isn’t a factor of the length of the longer one

c(25, 30) + c(25, 30)                   # same length, no warning

[1] 50 60

c(25, 30) + c(25, 30, 35, 40)           # different lengths but factorable lengths, no warning

[1] 50 60 60 70

c(25, 30) + c(25, 30, 35)               # different lengths, NOT factorable lengths, warning

Warning in c(25, 30) + c(25, 30, 35): longer object length is not a multiple of
shorter object length

[1] 50 60 60

Note

go here for matrix subsetting

two-dimensional; basically vector with another dimension added
all elements are same type
number of rows and cols must be multiple of data length
- nrow and ncol are optional

matrix_structure <- matrix(25:30, nrow = 2, ncol = 3)
matrix_structure

     [,1] [,2] [,3]
[1,]   25   27   29
[2,]   26   28   30

Note

go here for dataframe subsetting

two-dimensional (contains multiple vectors; one in each column)
columns can contain different element types
all rows must be same length
equivalent of tidyverse’s tibble
works easily with dplyr, ggplot2, and tidyr

dataframe_structure <- data.frame(name = c("A", "B"), age = c(25L, 30L), bday = c(1.1, 2.2), vector_structure)
dataframe_structure

  name age bday vector_structure
1    A  25  1.1               25
2    B  30  2.2               30

Note

go here for list subsetting

flexible dimensionality
columns can contain different element types (including vectors and dataframes)
rows can be different lengths and contain multiple element types

list_structure <- list(name = c("A", "B"), age = c(25L, 30L, 35L), bday = c(1.1, 2.2), age_mix = c(25L, 30.25), dataframe_structure, vector_structure)
list_structure

$name
[1] "A" "B"

$age
[1] 25 30 35

$bday
[1] 1.1 2.2

$age_mix
[1] 25.00 30.25

[[5]]
  name age bday vector_structure
1    A  25  1.1               25
2    B  30  2.2               30

[[6]]
[1] 25 30

subsetting

first, call the vector, then specify position or logical test in [single brackets]
will always return a vector

vector_structure[1]                           # 1st element

[1] 25

vector_structure[c(1, 2)]                     # 1st and 2nd element

[1] 25 30

vector_structure[-1]                          # all elements except the 1st

[1] 30

vector_structure[vector_structure < 30]       # all elements smaller than 30

[1] 25

call the matrix, then use [row, column] argumentation with single brackets
always returns a vector

matrix_structure[1, 2]          # element at row 1, column 2

[1] 27

matrix_structure[2, ]           # entire row 2

[1] 26 28 30

matrix_structure[ , 2]          # entire column 2

[1] 27 28

matrix_structure[1:2, 3]        # submatrix of rows 1-2 and column 3

[1] 29 30

matrix_structure[ , c(1, 3)]    # columns 1 and 3

     [,1] [,2]
[1,]   25   29
[2,]   26   30

dataframe_structure[1, ]               # dataframe of 1st row with all columns

  name age bday vector_structure
1    A  25  1.1               25

dataframe_structure[ , 2]              # vector of 2nd column with all rows

[1] 25 30

dataframe_structure["name"]            # dataframe of "name" column

  name
1    A
2    B

dataframe_structure[ , "name"]         # vector of "name" column

[1] "A" "B"

dataframe_structure$name               # vector of "name" column

[1] "A" "B"

dataframe_structure$name[2]            # vector of value at 2nd row, "name" column

[1] "B"

dataframe_structure[["name"]]          # vector of "name" column

[1] "A" "B"

dataframe_structure[1, 2]              # vector of value at 1st row, 2nd column

[1] 25

dataframe_structure[1:2, "name"]       # vector of rows 1 - 2, "name" column

[1] "A" "B"

dataframe_structure[[1]]               # dataframe of 1st column

[1] "A" "B"

call the list, then use [single brackets] to get a sublist
- returns as list
to get actual elements, use [[double brackets]] or $dollarsign
- returns as vector

list_structure[1]                 # list of 1st element's contents

$name
[1] "A" "B"

list_structure[[1]]               # vector of 1st element's contents

[1] "A" "B"

list_structure[[1]][2]            # vector of 1st element's 2nd value

[1] "B"

list_structure["name"]            # list of element "name"'s contents

$name
[1] "A" "B"

list_structure$name               # vector of element "name"'s contents

[1] "A" "B"

list_structure$name[2]            # vector of element "name"'s 2nd value

[1] "B"

list_structure[["name"]]          # vector of element "name"'s contents

[1] "A" "B"

list_structure[["name"]][2]       # vector of element "name"'s 2nd value

[1] "B"

base r

operators

# arithmetic operators
x + y
x - y
x * y
x / y
x ^ y
x %% y                      # modulus; returns the remainder of x/y
x %/% y                     # integer division; returns the whole number result of x/y

# assignment operators 
x <- 2                      # used for assigning values to objects
mean(x = c(1, 2))           # used to specify what arguments should evaluate to within functions

# comparison operators
x == x                      # x equals x
x != y                      # x does NOT equal y
x > y                       # x is greater than y
x < y                       # x is less than y
x >= y                      # x is greater than or equal to y
x <= y                      # x is less than or equal to y

# logical operators (combine comparison statements)
logic_x & logic_y           # vectorized AND; logic_x AND logic_y are true
logic_x && logic_y          # non-vectorized AND; logic_x AND logic_y are true. not used very often
logic_x ! logic_y           # NOT; logic_x NOT logic_y. can be combined with other operators
logic_x | logic_y           # vectorized OR; logic_x OR logic_y are true
logic_x || logic_y          # non-vectorized OR; logic_x OR logic_y are true. not used very often

# misc operators
1:2                         # 1 through 2
element_x %in% vector_y     # element_x is in vector_y
matrix_x %*% matrix_y       # matrix multiplication
dependent_x ~ independent_y # separates dependent and independent variables in formulas that specify relationships between variables

exploring data

head()
- takes a dataset and returns the first six rows
- args:
  - head(dataset)
tail()
- takes a dataset and returns the last six rows of the dataset
- args:
  - tail(dataset)

functions

`paste()`, `paste0()`

`apply()`

applies a function to the rows or columns of a matrix or data frame
args:
- apply(dataset, vector_of_operating_dimension, function)
  - for vector_of_operating_dimension, using 1 will apply the function over rows, 2 over columns, and c(1, 2) over rows and columns

`lapply()`, `sapply()`, `vapply()`

applies a function over a list or vector
lapply() returns a list the same length as the input dataset
sapply() returns a vector or matrix
- makes it more user-friendly
- is functionally the same as lapply() with more possible arguments and customization
vapply() is similar to sapply() but requires an argument that specifies how the return will be formatted

control flow

conditionals

if ()
else ()
while ()
- you MUST add a line of code that ensures the when() condition eventually ends
- if you don’t, you’re going to make an infinite loop that eventually crashes your r
- all while() conditionals can be written as for() loops, but not all for() loops can be written as while() conditionals

loops

there are two ways to run for () loops in r: over elements or over index positions
looping over elements looks like this:

for (element in vector_structure)) {
  element == literal_value_of_element_in_the_vector
}

looping over indices looks like this:

for (i in seq_along:vector_structure)) {
  vector_structure[i] == whatever_value_is_in_position_number_i_in_vector_structure
  i == number_of_loops
}

# note: the `i` here can be named anything you want
# using `i` for index positions is most common
# personally it reminds me i'm looking at positions, not elements

looping over indices may seem more complicated, but it’s for a good reason!
using index positions gives you the flexibility to access the actual value of the element at the same time as the position. this isn’t as easy if you try to do it the other way around (figure out index position given the value of the element)

user-defined functions

stop()
return()
{{ }} only gets used to refer to an argument when incorporating tidy evaluations into a user-defined function

tidyverse

dplyr (work in progress!)

Note

see here for an exhaustive list of functions in dplyr

column functions

`select()`

mostly used to select a set of columns, can be used to reorder columns or rename them
- note: takes column names as objects, not strings (i.e., column_1, not “column_1”)
returns a tibble (dataframe) with the specified columns and drops all other columns
args:
- select(dataset, arguments_for_selecting_columns_here)
- for renaming columns, select(dataset, new_name = old_name)
- for reordering columns, select(dataset, column_3, column_1, column_2)
selection helpers
- starts_with("prefix"), ends_with("suffix"), contains("text"), matches("regex")
- where(function())
  - returns variables where function() evaluates to TRUE
- everything()
  - can take the argument vars = c() where c() is a vector of variable names
  - if vars = is left empty, takes all variables from the current context/pipe

`rename()`

renames some or all columns
returns a tibble with all of the columns in the original order
args:
- rename(dataset, new_name = old_name)

`relocate()`

relocates columns in dataframes using relative positions
returns a tibble with all of the columns
args:
- relocate(dataset, column_3, .before = column_2, .after = column_1)

`mutate()`

can be used to add new named columns
computes values based on rows
columns are created in the order given by the human (you!)
args:
- mutate(dataset, new_column_1 = ..., new_column_2 = ...)

`glimpse()`

takes a dataframe or tibble
returns the number of rows, the number of columns, the names of columns, and the value types of each column
args:
- glimpse(dataset)

`across()`

row functions

`filter()`

returns a tibble (dataframe) with the chosen rows, drops all other rows
rows are chosen using a condition based on values in one or more columns
args:
- filter(dataset, filter_variable)

`arrange()`

arranges a dataframe, sorting rows by column values
if given 2 or more variables, sorts it in the priority of order given
- strings are sorted alphabetically
- default sort is ascending order
args:
- arrange(dataset, column_1, column_2)
- for descending order, arrange(dataset, desc(column_1))

`distinct()`

keeps only the unique/distinct rows from a dataframe and removes the repeats
good for determining the different values present for a given variable
- e.g., you want to figure out the number of unique participants in a long dataset with multiple timepoints
args:
- distinct(dataset, .keep_all = FALSE)
  - if .keep_all = TRUE, all variables in the dataset are retained
  - if there are some configurations of the rows that aren’t unique, the first instance of the row is kept

`summarize()`

computes summary statistics across groups
returns a new data frame that with one row for each combination of grouping variables
- if there are no grouping variables, returns one row summarizing all observations in the data frame
works well with mean(), sd(), min(), max(), etc.
args:
- summarize(dataset, col_name = operation())

`rowwise()`

grouping functions

`group_by()`

creates groups of rows based on unique values across one or more columns
good for using with functions like summarize() or count()
args:
- group_by(dataset, grouping_variable)

`count()`

takes a dataframe or a tibble
counts the number of observations in each group
args:
- count(dataset, grouping_variable)

vector functions

`case_when()`

`if_else()`

`recode_values()` and `replace_values()`

tidyr (work in progress!)

pivot_longer()

pivot_wider()

separate()

unite()

ggplot2

makes cool graphics for data analysis
view the official guide here
always starts using the format ggplot(dataset, mapping = aes())
is based on a layer logic - each geom is layered on top of each other
- layers can inherit arguments from each other
- arguments set for the whole ggplot (in the initial mapping = aes()) are inherited by the geoms set afterwards

geoms

Note

see here for more geom types

`geom_point()`

good for scatterplots
defaults to stat = "identity"

`geom_hline()`, `geom_vline()`, `geom_abline()`

create flat/straight lines (horizontal, vertical, and specified slope and intercept, respectively)
linetype can range from 0 to 6 (preset line types)
- see here for more documentation
linewidth is set in mm

`geom_smooth()`

creates a line that represents an approximating function
takes the argument method = which allows you to specify what type of function to use
- defaults to method = "loess" (local polynomial regression fitting) when there are over 1,000 observations
- defaults to method = "lm" (linear model) when there are under 1,000 observations
also takes the argument formula = which allows you to specify the specific formula using x and y
- defaults to formula = NULL

`geom_bar()`

makes bar charts, but can also make stacked bar charts
- defaults to position = "stack"
defaults to stat = "count"
if x is categorical, you can go from a stacked bar chart to a grouped/clustered bar chart using position = "dodge"

`geom_histogram()`

`geom_boxplot()`

`geom_area()`

creates an area that gets shaded in
good for area graphs

`geom_text()`, `geom_label()`

creates text within the plot
geom_text() only creates text
- takes the argument check_overlap = to see if it’s overlapping other text
  - setting to TRUE will make sure the text does not overlap with other text, FALSE will let it do so
geom_label() will create text that has a rectangle behind it, making it easier to read
requires the argument label = to tell it what to say

aesthetics

geom aesthetics
- x = is needed for mapping any ggplot
  - y = is sometimes required; will default to "stat = identity" in specific geoms that require a y argument and "stat = count" when optional
    - "identity" means it will use y as the y-axis plotting variable
    - "count" means it will try to count the instances of x and use that as the y-axis plotting variable
general aesthetics can be used with scales as well as geoms
- fill is background/inside color
  - i like this site for looking up named colors in r
- color is outline
- alpha is transparency (100 = opaque, 0 = completely transparent)
- size is self explanatory
- shape is the shape of the point
  - mostly used with geom_point
  - can range from 0 through 25 (which are preset shapes), a character (which will become the shape of the point), or NA for nothing
    - only shapes 19 through 25 take fill argument
    - all other shapes only take color argument
  - see here for more documentation
a cool note: you can set aesthetics to conditionals using if_else()

scales (or: adjusting how data is displayed)

labs
- creates labels for your plot
- typical args are x =, y =, and title =
- other args:
  - subtitle = "Your text here" will create text below the title
  - caption = "Your text here" will create text displayed to the bottom-right of the plot
    - often used for sources, notes, or copyrights
  - tag = "Your text here" will create text displayed to the top-left of the plot
    - often used for labelling plots with letters
  - if you input an aesthetic that maps onto a specific variable, you can set the aesthetic as an argument to title the scale
    - e.g., if each color represents a specific group, you could do color = "Groups" to title the color legend “Groups” instead of the name of the raw variable
- if a plot has a label you want to get rid of, you can set the argument to = NULL and get rid of it completely
  - if you want the space to be still allocated, you can set it to an empty string (= "")
lims()
- controls the limits of the axes for the plot
- takes the arguments (x = c(lower_x_limit, upper_x_limit), y = c(lower_y_limit, upper_y_limit)
- can also take aesthetics as arguments
scales
- scale_x_discrete()/scale_y_discrete
  - args:
    - labels = c("Label 1", "Label 2") controls the text of the labels for the axis ticks
- scale_x_continuous()/scale_y_continuous
  - args:
    - n.breaks = controls the number of major ticks along the axis
    - breaks = c() controls the exact ticks that will show up
    - labels = controls the format of the tick labels
      - example arguments:
        
        scales::percent
        
        scales::label_dollar
        
        see here for more

guides (axes and legends)

must be wrapped in a bigger guide() layer
guide_legend()
guide_axis()
- takes the argument angle =, lets you angle the text
- can also be set in the respective scale_* layer as guide = guide_axis()

stringr (work in progress!)

str_view()

str_detect()

str_count()

str_extract(), str_extract_all()

str_replace(), str_replace_all()

str_split()

regular expressions (regex)

purrr (work in progress!)

map()

forcats (work in progress!)

fct_relevel()

fct_infreq()

fct_reorder()

lubridate (work in progress!)

make_datetime()

ymd(), mdy(), dmy()

year(), month(), day(), wday()

basics

value types

functions for value types

types of data structures

vectorized functions vs. loops

functions for data structures

subsetting

base r

operators

exploring data

functions

paste(), paste0()

apply()

lapply(), sapply(), vapply()

control flow

conditionals

loops

user-defined functions

tidyverse

dplyr (work in progress!)

column functions

select()

rename()

relocate()

mutate()

glimpse()

across()

row functions

filter()

arrange()

distinct()

summarize()

rowwise()

grouping functions

group_by()

count()

vector functions

case_when()

if_else()

recode_values() and replace_values()

tidyr (work in progress!)

ggplot2

geoms

geom_point()

geom_hline(), geom_vline(), geom_abline()

geom_smooth()

geom_bar()

geom_histogram()

geom_boxplot()

geom_area()

geom_text(), geom_label()

aesthetics

scales (or: adjusting how data is displayed)

facets

guides (axes and legends)

stringr (work in progress!)

regular expressions (regex)

purrr (work in progress!)

forcats (work in progress!)

lubridate (work in progress!)

`paste()`, `paste0()`

`apply()`

`lapply()`, `sapply()`, `vapply()`

`select()`

`rename()`

`relocate()`

`mutate()`

`glimpse()`

`across()`

`filter()`

`arrange()`

`distinct()`

`summarize()`

`rowwise()`

`group_by()`

`count()`

`case_when()`

`if_else()`

`recode_values()` and `replace_values()`

`geom_point()`

`geom_hline()`, `geom_vline()`, `geom_abline()`

`geom_smooth()`

`geom_bar()`

`geom_histogram()`

`geom_boxplot()`

`geom_area()`

`geom_text()`, `geom_label()`