Finding Data
Why You Need a Dataset
The integrative data project requires you to demonstrate the full research workflow: data preparation, analysis, and manuscript reporting. You’ll build a reproducible pipeline from raw data to publication-ready output.
This differs from core standards projects, which focus on specific technical skills. The integrative project integrates everything across the quarter.
Your dataset should be complete enough to support a full analysis pipeline—from data cleaning and transformation through statistical modeling to final visualization. Even in D2M-R I when we don’t cover advanced analysis, this should still be the goal for your data. This means having sufficient observations, relevant variables for your research questions, and enough complexity to demonstrate meaningful data wrangling skills without being so messy that cleaning becomes the entire project.
Using Your Own Data
“Your own data” means data you have a personal stake in analyzing as part of research you’re conducting or contributing to. This includes:
- Thesis or dissertation data (BA, MA, or PhD)
- Data from a lab you work in or research team you’re part of
- Data you’re collecting for a faculty member’s project
- Pilot data from studies you’re designing
- Data from independent research projects
It does not necessarily have to be data you collected yourself, and it can even be data collected for an entirely different purpose. “Your” data just means it’s something you’re bringing to the table for your own work rather than trying to find something out there to make work for this class specifically.
Quarter 1: Using your own data is strongly preferred. Students working with thesis data (BA or MA), lab projects, or ongoing research get the most from this course. You’ll build directly applicable skills and advance your actual research.
Quarter 2: You are required to use your own data for the integrative project. You must have a dataset with personal investment and clear research questions before the start of the quarter.
If you lack your own data, publicly available datasets work for Quarter 1. However, start identifying your own dataset early—you’ll need it by Quarter 2.
Finding Public Data
Start with your research question. Identify the domain (psychology, public health, social science) and core variables you need. This focus prevents getting lost in vast repositories.
Use multiple search approaches:
- Domain-specific repositories (ICPSR, Roper, Journal of Open Psychology Data) offer curated, documented datasets
- Meta-search engines (Google Dataset Search, re3data.org) cast a wider net across repositories
- General platforms (Harvard Dataverse, Figshare, Kaggle) provide diverse options with varying quality
- Specialized collections (Data is Plural, Components, Awesome Public Datasets) surface interesting datasets you might not find otherwise
Evaluate documentation quality. Well-documented datasets include codebooks, data dictionaries, and methodology descriptions. Poor documentation will cost you time and introduce errors.
See below for a list of public data collections to get started.
Essential Dataset Criteria
Completeness
You need a dataset covering all variables for your analysis. If your research question requires demographics, behavioral measures, and temporal data, the dataset must include all three.
Partial datasets work only if you can extract a complete subset. Pilot data with full variable coverage beats a larger dataset missing key variables.
Structural Consistency
Check for uniform column names, consistent date formats, and standardized categorical values. Inconsistencies like “Yes”/“yes”/“Y” or “01/15/2024”/“2024-01-15” create silent errors that corrupt analyses.
Inspect the first 50-100 rows for structural oddities before committing to a dataset. Structural inconsistency at the outset doesn’t need to be a dealbreaker, but it means you’ll have more work to do in data preparation stages.
Plain Text Format
R processes plain text. While packages can read Excel or SPSS files, you’ll convert them to CSV or TSV eventually. Start with formats ending in .csv, .tsv, or .txt when possible.
Most repositories offer CSV export. If only proprietary formats exist, verify conversion tools work before investing time in the dataset.
Tabular Structure
Data must organize into rows and columns. Standard tidy format (each column = variable, each row = observation) simplifies analysis, but transposed or nested structures work if convertible.
Spreadsheets typically meet this criterion. Unstructured text or XML requires preprocessing.
Research Question Alignment
The dataset must enable answering specific research questions. On day one, articulate 2-3 concrete questions the data can address.
Students with interesting research projects tend to be more interested in the work and generally learn more. Generic exploration of unfocused datasets produces generic insights.
Anonymization
GitHub retains complete history. Only commit fully anonymized data. Never plan to “anonymize later.”
For human subjects data, verify no personally identifiable information exists. Random participant IDs, geographic aggregation to region level, and date jittering are standard approaches.
If you are using publicly available data from a credible source, you don’t have to worry about anonymization. However, if you are using your own data or “quasi-”open data (like from a large research project with collaborators at multiple institutions), you’ll need to be very careful about this.
Check with advisors about discipline-specific anonymization standards.
Common Pitfalls
Dataset too large. Start with manageable size (under 100MB). You can always expand later.
Missing codebook. Variable names like “V1”, “V2”, “Q37” without explanation make analysis impossible.
Aggregated data when you need individual observations. Summary statistics can’t be disaggregated.
Licensing restrictions. Verify the dataset permits academic use and redistribution if needed.
Format mismatch. PDF tables or image-based data require extraction work that exceeds course scope.
Getting Help
If uncertain about dataset suitability, download it, examine the first few rows, and bring specific questions to Dr. Dowling or your TA. “Can this work?” questions need concrete examples, not abstract descriptions.
Public Data
- Find and use R’s built-in datasets
- Not suitable for integrative data projects, but great for practice and core standards projects
- data.gov
- Chicago Data Portal
- Harvard’s Data Repository
- ICPSR Data Repository
- Google’s Dataset Search
- Kaggle datasets
- Components Datasets
- Roper iPoll public opinion data
- AWS Data
- Data.world open datasets
- re3data.org registry of data repositories
- Journal of Open Psychology Data
- Journal of Open Archeology Data
- Journal of Open Public Health Data
- Open Data Network
- Figshare
- Awesome Public Datasets on Github
- Data is plural database (fun datasets of all types!)
Thank you to Dr. Jean Clipperton for compiling the majority of this list for her class on Data Visualization!