Stata collapse: How to Aggregate Data with Examples
Need to go from individual-level to group-level data? collapse does it in one line. Full syntax, aggregation functions, and gotchas.
You need firm-level means for a table, but one wrong collapse command can wipe your analysis dataset.
You will learn how to aggregate quickly while keeping a recoverable, auditable workflow.
All examples tested in Stata 18 SE. Compatible with Stata 15+.
Quick Answer
- Use `collapse` when you intentionally want fewer rows at a higher aggregation level.
- Wrap collapse with `preserve` and `restore` in production scripts.
- Name output variables explicitly to avoid confusion in downstream merges.
- Verify row counts after aggregation.
Aggregate Data Without Destroying Your Pipeline
Compute multi-statistic aggregates by firm and year
A robust collapse block should communicate exactly what statistic is applied to each variable. Ambiguous naming is a common replication failure point.
Use explicit aliases and grouped keys so your aggregate dataset can be merged back safely.
If you are extending this pipeline, also review Stata preserve/restore and tempvar patterns and reghdfe in Stata: Fixed Effects Tutorial.
1clear all2set obs 12003gen firm_id = ceil(_n/12)4gen year = 2014 + mod(_n,10)5gen wage = 22 + rnormal(0,3)6gen education = 11 + floor(runiform()*8)78preserve9collapse (mean) mean_wage=wage mean_education=education (count) n_workers=wage (p50) med_wage=wage, by(firm_id year)1011isid firm_id year12list firm_id year mean_wage n_workers in 1/613restore +-----------------------------------------+
| firm_id year mean_wage n_workers |
|-----------------------------------------|
1. | 1 2014 22.91841 2 |
2. | 1 2015 20.73492 1 |
3. | 1 2016 23.40188 1 |
+-----------------------------------------+Weighted collapse for survey-style summaries
Weighted aggregation matters when records represent different population masses. Without weights, your averages can misrepresent target populations.
Always record which weight type is used and why; reviewers routinely ask for this justification.
1clear all2set obs 12003gen firm_id = ceil(_n/12)4gen year = 2014 + mod(_n,10)5gen wage = 22 + rnormal(0,3)6gen education = 11 + floor(runiform()*8)78preserve9collapse (mean) mean_wage=wage mean_education=education (count) n_workers=wage (p50) med_wage=wage, by(firm_id year)1011isid firm_id year12list firm_id year mean_wage n_workers in 1/613restore1415* ---- Section-specific continuation ----16gen pop_weight = 0.5 + runiform()*31718preserve19collapse (mean) mean_wage=wage [aw=pop_weight] (sum) total_weight=pop_weight, by(year)2021list year mean_wage total_weight22restore +--------------------------------+
| year mean_wage total_weight |
|--------------------------------|
1. | 2014 22.19015 223.4188 |
2. | 2015 21.97380 221.9557 |
3. | 2016 22.10645 224.7311 |
+--------------------------------+Common Errors and Fixes
"varlist required"
collapse needs at least one variable-statistic pair. Calling collapse with only by() is invalid syntax.
Specify at least one statistic block such as `(mean) wage` before by().
varlist required r(100);
collapse, by(firm_id year)collapse (mean) wage education, by(firm_id year)1preserve2collapse (mean) mean_wage=wage mean_education=education, by(firm_id year)3restore(sum of wgt is 1,200) . collapse (mean) mean_wage=wage mean_education=education, by(firm_id year)
Command Reference
collapse
Stata docs โAggregates the dataset to group-level observations with selected statistics.
(mean)Group mean(count)Nonmissing count(p50)Median within groupby(varlist)Grouping keys for aggregated rowsHow Sytra Handles This
Sytra can generate safe collapse pipelines with preserve/restore and explicit output names, reducing accidental data overwrites.
A direct natural-language prompt for this exact workflow:
Collapse worker-level data to firm-year means of wage and education, count workers, keep weighted means by year, and return a merge-ready dataset with firm_id year keys.Sytra catches these errors before you run.
Sytra can generate safe collapse pipelines with preserve/restore and explicit output names, reducing accidental data overwrites.
Join the Waitlist โFAQ
Does collapse overwrite my dataset?
Yes. collapse replaces the current data in memory, so wrap it with preserve/restore or save a temporary copy first.
Can collapse compute multiple statistics at once?
Yes. You can specify mean, count, sum, p50, and more in one command using parentheses by statistic.
How do I aggregate by more than one key?
List all grouping variables in `by()` such as `by(firm_id year)` to keep two-dimensional aggregation.
Related Guides
- Stata egen Functions: Complete Reference with Examples for Every Function
- Reshape in Stata: Wide to Long and Long to Wide with Real Panel Data
- API Data in Stata: Import JSON/CSV Feeds and Build Analysis-Ready Panels
- Importing Data into Stata: Excel, CSV, Fixed-Width, SAS, and SPSS
- Explore the data management pillar page
- Open the full data management guide index
- Browse all Stata & R guides on the blog index
- Browse all Stata pillars
We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.