
visualization basics, ggplot2, grammar of graphics
2026-02-17
What is data visualization?
![]()
Clarity
Communicate all – and only – information necessary to tell the story
![]()
Simplicity
Minimize distraction & send one message at a time
![]()
Accuracy
Use reliable data (garbage in, garbage out) and be faithful to it
![]()
Consistency
Represent similar ideas in similar ways & meet audience expectations
![]()
Relevance
Know your audience & speak to them
# First create the scatter plot
plot(iris$Sepal.Length, iris$Sepal.Width,
main = "base R: Iris Sepal Length vs Width",
xlab = "Sepal Length",
ylab = "Sepal Width",
col = as.numeric(iris$Species),
pch = 19)
# Add regression lines for each species
species_levels <- levels(iris$Species)
colors <- 1:3
for(i in 1:3) {
subset_data <- iris[iris$Species == species_levels[i], ]
reg <- lm(Sepal.Width ~ Sepal.Length,
data = subset_data)
abline(reg, col = colors[i], lwd = 2)
}
# Add legend
legend("topright",
legend = levels(iris$Species),
col = colors,
pch = 19)
latticeggplot2ggplot2ggplot2 packagedplyr’s “grammar for data manipulation”
What’s the difference between ggplot and ggplot2?
ggplot() is a function, ggplot2 is a package.
So ggplot2 is version 2?
No.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width,
color = Species)) +
# Add points with custom appearance
geom_point(size = 3, alpha = 0.7) +
# Add regression lines with custom appearance
geom_smooth(method = "lm", se = TRUE, alpha = 0.2,
linewidth = 1.2, linetype = "dashed") +
# Customize colors using a custom palette
scale_color_manual(values =
c("#FF6B6B", "#4ECDC4", "#45B7D1")) +
# Add labels with custom formatting
labs(title = "Sepal Dimensions Across Iris Species",
subtitle = "Comparing Length vs Width with Trend Lines",
x = "Sepal Length (cm)",
y = "Sepal Width (cm)",
caption = "Data: Edgar Anderson's Iris Dataset") +
# Customize theme elements:
# title, axis, legend, panel, borders
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold",
margin = margin(b = 20)),
plot.subtitle = element_text(size = 12, color = "grey40"),
axis.title = element_text(size = 10, face = "bold"),
axis.text = element_text(size = 9),
legend.position = "bottom",
legend.title = element_text(face = "bold"),
legend.background = element_rect(
fill = "white", color = "grey90"),
panel.grid.major = element_line(color = "grey90"),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
panel.border = element_rect(color = "grey90", fill = NA)
) +
# Set specific axis limits
coord_cartesian(
xlim = c(min(iris$Sepal.Length) - 0.2,
max(iris$Sepal.Length) + 0.2),
ylim = c(min(iris$Sepal.Width) - 0.2,
max(iris$Sepal.Width) + 0.2)
)
pivot_longer(iris, cols=c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
names_to="Measurement",
values_to="Value") %>%
ggplot(aes(x=Value, fill=Species)) +
geom_histogram(alpha=0.8, bins = 15, # histogram
position = "identity") +
facet_wrap(~Measurement, scales="free") + # subplots
theme_minimal() +
labs(title="Density Distributions of Iris Measurements",
x="Measurement Value (cm)")
ggplot(data = df)
aes()
geom_*()
stat_*() or geom_*(stat = "...")
scale_[aes-type]_[data-type]()
coord()
facet_wrap() or facet_grid()
theme()
ggplot(data = iris, # Data layer
aes(x = Sepal.Length, # Aesthetics
y = Sepal.Width,
color = Species)) +
geom_point() + # Geometries
stat_smooth(method = "lm") + # Statistics
scale_color_viridis_d() + # Scale
coord_cartesian(xlim = c(4, 8)) + # Coordinates
facet_wrap(~Species) + # Facets
theme_bw() + # Theme
labs(title = "Iris Plot")
ggplot(data = iris, # Data layer
aes(x = Sepal.Length, # Aesthetics
y = Sepal.Width,
color = Species)) +
geom_point() + # Geometries
#stat_smooth(method = "lm") + # Statistics
scale_color_viridis_d() + # Scale
coord_cartesian(xlim = c(4, 8)) + # Coordinates
#facet_wrap(~Species) + # Facets
theme_bw() + # Theme
labs(title = "Iris Plot")
ggplot(data = iris, # Data layer
aes(x = Sepal.Length, # Aesthetics
y = Sepal.Width,
color = Species)) +
geom_point() + # Geometries
stat_smooth(method = "lm") + # Statistics
#scale_color_viridis_d() + # Scale
#coord_cartesian(xlim = c(4, 8)) + # Coordinates
facet_wrap(~Species) + # Facets
#theme_bw() # Theme
labs(title = "Iris Plot")
Grammar of data manipulation
dplyr pipelines|> or %>%Grammar of graphics
ggplot2 layers+Essential
Common
Specialized
aes)linetype aes with a line graph, but not a scatterplotfill aes for filled shapes like bar graphs, but not line graphsshape with categorical/discrete data, because there is a fixed set of shapes to usesize with continuous data, because size itself is continuous and unboundedcolor for continuous or categorical, because colors can be assigned discretely or generated across a spectrumaesthetics vs “aesthetic”fill, color, size, etc. can be aesthetic or aesthetic
aesthetic: data-dependent visual properties
aes() functionFixed properties use the same elements without data-dependence
aes reference (1/2)| Aesthetic | Description | Mapped data | Unmapped specs |
|---|---|---|---|
| x and y (position) | x and y coordinates of the plot. Nearly every plot requires at least one of these “position” aesthetics. | continuous, categorical | n/a |
| group | How observations are grouped together (if not defined with another grouping aes) | categorical | n/a |
| color | Color of points, lines, text, and other 1D shapes. For filled shapes, this will be the outline color. | continuous, categorical | string (color name or hex code) |
| fill | Fill color of 2D shapes like bars, polygons, etc. | continuous, categorical | string (color name or hex code) |
| alpha | Opacity/transparency of any element | continuous | number between 0 and 1 |
aes reference (2/2)| Aesthetic | Description | Mapped data | Unmapped specs |
|---|---|---|---|
| size | Size of points and width of lines | continuous | numeric values |
| shape | Shape of points, out of 26 options. Default is a solid circle (19) | categorical | integers 0-25 or shape name (string) (view guide) |
| linetype | Type of line, out of 6 options or blank (0). Default is solid line (1). | categorical | integers 0-6 or line name (string) (view guide) |
| linewidth | Width of lines | continuous | numeric values |
| stroke | Width of shape outlines | continuous | numeric values |
| label | Text content | continuous, categorical | strings |
| fontface | Font style for text | categorical | String: “plain”, “bold”, “italic”, or “bold.italic” |
| family | Font family for text | categorical | String with font name |




aes() functionScatter plots:
geom_point()x and ycolor, shape, size, etc.Bar charts:
geom_bar()xfilly mapping| Geom Layer | Variables* | Description | Data Types |
|---|---|---|---|
geom_histogram() |
1 | Creates bins and counts observations within each bin | Continuous x |
geom_density() |
1 | Creates a smoothed density estimate | Continuous x |
geom_boxplot() |
1-2 | Shows distribution summary with quartiles and outliers | Continuous y, Optional categorical x |
geom_violin() |
1-2 | Shows density estimate symmetrically | Continuous y, Categorical x |
geom_bar() |
1-2 | Creates bars with heights proportional to number of cases | Categorical x |
| Geom Layer | Variables* | Description | Data Types |
|---|---|---|---|
geom_point() |
2 | Creates a scatter plot | Continuous x & y |
geom_line() |
2 | Connects observations in order | Continuous x & y |
geom_smooth() |
2 | Adds a smoothed conditional mean | Continuous x & y |
geom_area() |
2 | Creates a line plot filled to the x-axis | Continuous x & y |
geom_tile() |
2-3 | Creates rectangles based on x and y positions | Any x & y, Optional fill |
ggplot’s flexibility is a double edged sword!geom_histogram(): binned distribution
geom_boxplot(): distribution with quartiles and outliers
geom_bar(): count of observations in each category
geom_col(): height of bars based on pre-summarized data
iris |>
pivot_longer(
cols = starts_with("Sepal"),
names_to = "Measurement",
values_to = "Value"
) |>
group_by(Species, Measurement) |>
summarise(Value = mean(Value),
.groups = "drop") |>
ggplot(aes(x = Species, y = Value,
fill = Measurement)) +
geom_col(position = "dodge") +
labs(
title = "Mean sepal measurements by species",
x = "Species",
y = "Mean (cm)"
)
geom_point(): scatter plot
geom_jitter(): scatter plot with random noise
geom_smooth(): smoothed mean (e.g., regression line)
geom_line(): connects observations in order
D2M-R I