From Data to Manuscript in R
Preface
This is an in-development textbook for the sequence “Data to Manuscript in R” offered at the University of Chicago. What you’re seeing is in very, very early stages, with lots of blank spots, placeholders, unnecessary sections, and random notes and ramblings instead of clear, concise instruction. There’s also a lot of content redundant with content in other, more reliable, course resources, like the course slides and website. If you keep exploring the content even knowing what a mess it is, consider yourself warned!
This book is “published” as a resource for current students in the course as an example of how to use RStudio and Quarto to create and share documents. Students can also contribute to the book for course credit.
Outside of what I assign as readings for D2M-R, this book probably isn’t the best resource for learning R, especially if you aren’t currently a student in the class who will get all my caveats and asterisks.
If you’re looking for stable and complete resources to learn R, I recommend:
- R for Data Science.
- Happy Git and GitHub for the useR.
- STAT545
- Hands on Programming with R
- Data Science in a Box
- (Re)Master the Tidyverse Workshops
- RYouWithMe
Outline
Based on current slides, but definitely not ideal for the book…
One of the main goals for the book is to be a supplement to class slides. The slides are currently way too verbose, so the idea is to migrate the details and nitty-gritty stuff away from the slides and into this book, which can be read at the student’s own pace and be even more verbose when needed. Right now, there’s a lot of overlap and a lot of seemingly random cut-and-paste.
If you’re a student looking to contribute to the book for a core skills project, one way to jump in is to look through the slides and see where things could be migrated to the book to tighten up the slides and expand the book.
Quarter 1
- Introduction
- PSYC 30550 D2MR - Overview
- The class itself - link to course page
- Assessment structure
- D2MR Workflow
- This book
- The goals
- How to use it
- How to contribute
- Getting Started/Setting up (in the form of walking through a dummy repo)
- Download and install R and RStudio
- Get familiar with RStudio
- R Notebooks
- Preview quarto
- Get Git
- Connect RStudio to GitHub
- (Optional) Set up Copilot in RStudio
- PSYC 30550 D2MR - Overview
- Solving Problems
- Introduction - why this is a whole section’
- Using AI
- Best practices - keep your code:
- Standardized
- Intelligible
- Maintainable
- Contextualized
- Debugging & Troubleshooting
- Documentation
- Help/lookup functions
- Function documentation
- Package documentation
- Resources
- Solving your own problems
- Error messages
- Garbage in, garbage out (and other tropes)
- Line-by-line debugging
- Rubber ducking
- Ask the internet
- Start strong - asking good questions
- Crowdsource
- Use AI constructively
- Ask humans – for D2MR Students
- Ask your classmates
- Use Slack
- Ask your instructor & TAs
- The D2MR troubleshooting process
- Starting from “Nothing”
- Documentation
- RStudio Essentials
- RStudio
- Overview + what’s an IDE
- Layout
- Source pane
- Console pane
- The rest of the panes
- Environment
- History
- Files
- Plots
- Connections
- Packages
- Help
- Build
- VCS
- Tutorial
- Viewer
- Presentations
- Keyboard shortcuts
- Customizing RStudio
- File types - just a preview, we’ll cover these in more detail later!
- .Rmd
- .qmd
- .R
- .Rproj
- Lots of metadata files you can mostly ignore (including literally in your .gitignore)
- .Rhistory
- .RData
- .Rprofile
- .Rproj.user
- RStudio
- Git & GitHub Essentials
- Introduction – the point of version control
- Git
- Overview & workflow musts
- Git lingo
- Repositories
- repository / repo
- initialize
- clone
- branch & checkout
- Version control
- commit
- stage
- fetch
- pull
- push
- Merging
- diff
- merge
- merge conflict
- rebase
- fast-forward (ff)
- squash
- cherry-pick
- stash
- Remote repositories
- remote
- origin
- upstream
- fork
- pull request
- (Optional) Using Git in the terminal
- Repositories
- Repo structure
- What does and does not go in a repo?
- Top-level essentials
- README
- .gitignore
- The rest of your stuff
- Metadata and information
- GitHub
- What is GitHub? How is it different from Git?
- GitHub features
- Issues
- Pull requests
- Pages
- Copilot
- More
- Actions
- Projects
- Codespaces
- Interfacing with GitHub (without RStudio)
- GitHub website
- GitHub Desktop
- Other options
- VS Code and other IDEs
- GitHub CLI
- Interfacing with GitHub in RStudio
- Connecting RStudio to GitHub
- Using GitHub in RStudio
- Cloning a repo
- Committing changes
- Pushing changes
- Pulling changes
- Creating pull requests
- Git Pains
- Common issues
- Helpful resources
- The nuclear option
- R Language Essentials
- Introduction
- What is R and why use it?
- Object-oriented programming
- R syntax
- Variables
- Functions
- Data types
- Operators
- R data structures
- Vectors
- Lists
- Matrices
- Data frames
- Tibbles
- R packages
- What are packages?
- Installing and loading packages
- Same function, different packages, oh no!
- Functions with different names in different packages that all do the same thing
- Functions with the same name in different packages that do different things
- Commonly used packages in D2MR
- Introduction
- R Programming
- Essential concepts in base R
- Object assignment
- Creating dummy variables and dataframes
- Indexing and subsetting with [] & $
- (more base R essentials)
- Object assignment
- Iteration
- Conditional statements
- if else
- case_when
- Loops
- for loops
- while loops
- Conditional statements
- Functions in R
- Writing functions
- Function arguments and return values
- Scope and environments
- Regular expressions
- What is regex? What’s the point?
- Basic syntax
- Common use cases
- Essential concepts in base R
- Welcome to the Tidyverse!
- Introduction to the Tidyverse & tidy data
- Overview of the tidy ecosystem
- Tidy data principles
- Core packages in the Tidyverse and general functions
- Terminology
- Importing and exporting with readr
- Overview
- Tabular data - what counts?
- File types
- R objects, including tibbles
- Reading, writing, rereading (intermediate datasets)
- Tabular data - what counts?
- Reading data with read_* functions
- Writing data with write_* functions
- Other packages
- readxl for Excel files
- haven for SPSS, SAS, and Stata files
- googlesheets4 for Google Sheets
- jsonlite for JSON files
- DBI and dbplyr for databases
- Handling common import/export issues
- Overview
- Data manipulation with dplyr
- Introduction to dplyr
- the points of pipelines (highly readable, but verbose)
- The pipe operator (%>%) and magrittr
- Chaining operations with pipes
- Selecting data
- select, rename
- filter
- arrange
- Manipulating data
- mutate
- summarize
- group_by
- distinct
- count
- dplyr practice
- Introduction to dplyr
- Data tidying with tidyr
- What counts as “tidying” data?
- Remember what “tidy” means in the tidyverse
- Tidying is reshaping and systematically cleaning data
- Reshape data
- pivot_longer
- pivot_wider
- cast, melt, gather, spread, etc.
- Combine and split cells
- unite
- separate functions
- Expand tables
- expand
- complete
- Handle missing values
- drop_na
- fill
- replace_na
- Advanced: Nested data
- What is nested data and why might you use it?
- We’re not going to cover this in D2MR, but tidyr has functions for creating, reshaping, and transforming nested data
- What counts as “tidying” data?
- Introduction to the Tidyverse & tidy data
- Working with different data types in the tidyverse
- Why specialized packages exist for text, factors, and dates
- Base R can do this stuff, but it’s not always great
- Tidyverse functions are useful:
- Consistent syntax
- More intuitive functions
- More intuitive pattern matching (stringr)
- Seamlessly compatible with dplyr, tidyr, and the rest of the tidyverse
- Text data with stringr
- Overview
- Remember what strings and character vectors are?
- Remember what regex is?
- Special things about strings (using quotes, escaping characters, etc.)
- Base R can do string stuff - Useful base R string functions
- paste & paste0
- c
- toupper, tolower
- is.character
- toString
- Pattern matching with grep family
- Match strings
- str_detect
- str_starts
- str_ends
- str_count
- Subsetting and length
- str_sub
- str_subset
- str_length
- str_pad
- str_trunc
- str_trim
- Mutate, join, split
- str_sub (again)
- str_replace, str_replace_all
- str_remove, str_remove_all
- str_to_lower, str_to_upper, str_to_title
- str_split
- str_c, str_glue
- str_flatten
- Overview
- Factors with forcats
- Overview
- What are factors?
- They look like strings, how are they different? Why do we have to have a whole separate package for them?
- Levels
- Order
- Closed set
- Base R can do factor stuff (and unlike with strings, you’ll use these base R factor functions a lot!)
- factor
- levels
- relevel
- other base factor functions are mostly variations on these
- Useful forcats functions
- fct_relevel
- fct_reorder
- fct_rev
- fct_recode
- fct_collapse
- fct_other
- fct_drop
- fct_expand
- Overview
- Dates and times with lubridate
- Overview
- What are dates and times?
- Why do we need a whole package for them?
- Base R can do date and time stuff, but it’s not great
- as.Date
- as.POSIXct
- as.POSIXlt
- strptime
- lubridate functions
- Parsing dates and times
- ymd, mdy, dmy, ymd_hms, etc.
- parse_date_time
- Extracting components
- year, month, day, hour, minute, second
- wday, yday, mday
- Manipulating dates and times
- %m+% and %m-% for adding and subtracting time
- interval and duration
- floor_date, ceiling_date, round_date
- Formatting dates and times
- format
- strftime
- Parsing dates and times
- Overview
- Why specialized packages exist for text, factors, and dates
- Basic visualization and summary statistics
- This chapter is not the ggplot2 chapter or a deep stats chapter. It’s a connection between the two quarters of the class.
- Putting things together
- how data manipulation in the tidyverse leads to easier visualization
- how visualization leads to better understanding of your data
- how visualization and summary statistics fit together
- Distributions of continuous variables
- Histograms
- Box plots
- Summary statistics: mean, median, variance, standard deviation
- Comparing groups
- Bar plots
- Summary statistics: grouped means, medians, etc.
- t-tests
- More advanced for later: ANOVA, chi-squared tests
- Relationships between continuous variables
- Scatter plots and smoothing lines
- Correlation
- Basic linear regression
- More advanced for later: Multiple regression, logistic regression
- Midpoint review
Quarter 2
Appendix
- Glossary
- References