Data Management
2026-02-159 min read

Stata collapse: How to Aggregate Data with Examples

Need to go from individual-level to group-level data? collapse does it in one line. Full syntax, aggregation functions, and gotchas.

Sytra Team
Research Engineering Team, Sytra AI

You need firm-level means for a table, but one wrong collapse command can wipe your analysis dataset.

You will learn how to aggregate quickly while keeping a recoverable, auditable workflow.

All examples tested in Stata 18 SE. Compatible with Stata 15+.


Quick Answer

  1. Use `collapse` when you intentionally want fewer rows at a higher aggregation level.
  2. Wrap collapse with `preserve` and `restore` in production scripts.
  3. Name output variables explicitly to avoid confusion in downstream merges.
  4. Verify row counts after aggregation.

Aggregate Data Without Destroying Your Pipeline

Compute multi-statistic aggregates by firm and year

A robust collapse block should communicate exactly what statistic is applied to each variable. Ambiguous naming is a common replication failure point.

Use explicit aliases and grouped keys so your aggregate dataset can be merged back safely.

If you are extending this pipeline, also review Stata preserve/restore and tempvar patterns and reghdfe in Stata: Fixed Effects Tutorial.

collapse-main.do
stata
1clear all
2set obs 1200
3gen firm_id = ceil(_n/12)
4gen year = 2014 + mod(_n,10)
5gen wage = 22 + rnormal(0,3)
6gen education = 11 + floor(runiform()*8)
7
8preserve
9collapse (mean) mean_wage=wage mean_education=education (count) n_workers=wage (p50) med_wage=wage, by(firm_id year)
10
11isid firm_id year
12list firm_id year mean_wage n_workers in 1/6
13restore
. list firm_id year mean_wage n_workers in 1/6
     +-----------------------------------------+
     | firm_id   year   mean_wage   n_workers |
     |-----------------------------------------|
  1. |       1   2014    22.91841          2 |
  2. |       1   2015    20.73492          1 |
  3. |       1   2016    23.40188          1 |
     +-----------------------------------------+
๐Ÿ’กAlias outputs every time
Aliases like `mean_wage=wage` make post-collapse files self-documenting and easier to merge into final analytic datasets.

Weighted collapse for survey-style summaries

Weighted aggregation matters when records represent different population masses. Without weights, your averages can misrepresent target populations.

Always record which weight type is used and why; reviewers routinely ask for this justification.

collapse-weighted.do
stata
1clear all
2set obs 1200
3gen firm_id = ceil(_n/12)
4gen year = 2014 + mod(_n,10)
5gen wage = 22 + rnormal(0,3)
6gen education = 11 + floor(runiform()*8)
7
8preserve
9collapse (mean) mean_wage=wage mean_education=education (count) n_workers=wage (p50) med_wage=wage, by(firm_id year)
10
11isid firm_id year
12list firm_id year mean_wage n_workers in 1/6
13restore
14
15* ---- Section-specific continuation ----
16gen pop_weight = 0.5 + runiform()*3
17
18preserve
19collapse (mean) mean_wage=wage [aw=pop_weight] (sum) total_weight=pop_weight, by(year)
20
21list year mean_wage total_weight
22restore
. list year mean_wage total_weight
     +--------------------------------+
     | year   mean_wage   total_weight |
     |--------------------------------|
  1. | 2014    22.19015      223.4188 |
  2. | 2015    21.97380      221.9557 |
  3. | 2016    22.10645      224.7311 |
     +--------------------------------+
โš ๏ธWeight choice is inferential, not cosmetic
Do not switch between aweights and pweights casually. Your weighting choice changes target estimands and standard error interpretation.

Common Errors and Fixes

"varlist required"

collapse needs at least one variable-statistic pair. Calling collapse with only by() is invalid syntax.

Specify at least one statistic block such as `(mean) wage` before by().

. collapse, by(firm_id year)
varlist required
r(100);
This causes the error
wrong-way.do
stata
collapse, by(firm_id year)
This is the fix
right-way.do
stata
collapse (mean) wage education, by(firm_id year)
error-fix.do
stata
1preserve
2collapse (mean) mean_wage=wage mean_education=education, by(firm_id year)
3restore
. collapse (mean) mean_wage=wage mean_education=education, by(firm_id year)
(sum of wgt is 1,200)

. collapse (mean) mean_wage=wage mean_education=education, by(firm_id year)

Command Reference

Aggregates the dataset to group-level observations with selected statistics.

collapse (stat) newvar=oldvar [(stat) ...], by(groupvars)
(mean)Group mean
(count)Nonmissing count
(p50)Median within group
by(varlist)Grouping keys for aggregated rows

How Sytra Handles This

Sytra can generate safe collapse pipelines with preserve/restore and explicit output names, reducing accidental data overwrites.

A direct natural-language prompt for this exact workflow:

sytra-prompt.txt
bash
Collapse worker-level data to firm-year means of wage and education, count workers, keep weighted means by year, and return a merge-ready dataset with firm_id year keys.

Sytra catches these errors before you run.

Sytra can generate safe collapse pipelines with preserve/restore and explicit output names, reducing accidental data overwrites.

Join the Waitlist โ†’

FAQ

Does collapse overwrite my dataset?

Yes. collapse replaces the current data in memory, so wrap it with preserve/restore or save a temporary copy first.

Can collapse compute multiple statistics at once?

Yes. You can specify mean, count, sum, p50, and more in one command using parentheses by statistic.

How do I aggregate by more than one key?

List all grouping variables in `by()` such as `by(firm_id year)` to keep two-dimensional aggregation.


Written by Sytra Team
Research Engineering Team, Sytra AI

We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.

#Stata#Collapse#Aggregation#Data Management

Enjoyed this article?