Is Sytra free for researchers?

Yes. Sytra is free forever for individual researchers. You bring your own API key from OpenAI or Anthropic and pay only for the AI inference costs (typically $0.01-0.10 per query).

Does Sytra upload my data to the cloud?

No. Sytra runs entirely on your local machine. Your .dta files, .csv files, and code never leave your computer. Only the natural language prompt is sent to the AI provider.

What versions of Stata does Sytra support?

Sytra supports Stata 17 and later, including MP, SE, and BE editions.

Data Management

2026-02-159 min read

Stata collapse: How to Aggregate Data with Examples

Need to go from individual-level to group-level data? collapse does it in one line. Full syntax, aggregation functions, and gotchas.

Sytra Team

Research Engineering Team, Sytra AI

You need firm-level means for a table, but one wrong collapse command can wipe your analysis dataset.

You will learn how to aggregate quickly while keeping a recoverable, auditable workflow.

All examples tested in Stata 18 SE. Compatible with Stata 15+.

Quick Answer

Use `collapse` when you intentionally want fewer rows at a higher aggregation level.
Wrap collapse with `preserve` and `restore` in production scripts.
Name output variables explicitly to avoid confusion in downstream merges.
Verify row counts after aggregation.

Aggregate Data Without Destroying Your Pipeline

Compute multi-statistic aggregates by firm and year

A robust collapse block should communicate exactly what statistic is applied to each variable. Ambiguous naming is a common replication failure point.

Use explicit aliases and grouped keys so your aggregate dataset can be merged back safely.

If you are extending this pipeline, also review Stata preserve/restore and tempvar patterns and reghdfe in Stata: Fixed Effects Tutorial.

collapse-main.do

stata

1clear all
2set obs 1200
3gen firm_id = ceil(_n/12)
4gen year = 2014 + mod(_n,10)
5gen wage = 22 + rnormal(0,3)
6gen education = 11 + floor(runiform()*8)
7
8preserve
9collapse (mean) mean_wage=wage mean_education=education          (count) n_workers=wage          (p50) med_wage=wage, by(firm_id year)
10
11isid firm_id year
12list firm_id year mean_wage n_workers in 1/6
13restore

. list firm_id year mean_wage n_workers in 1/6

     +-----------------------------------------+
     | firm_id   year   mean_wage   n_workers |
     |-----------------------------------------|
  1. |       1   2014    22.91841          2 |
  2. |       1   2015    20.73492          1 |
  3. |       1   2016    23.40188          1 |
     +-----------------------------------------+

💡Alias outputs every time

Aliases like `mean_wage=wage` make post-collapse files self-documenting and easier to merge into final analytic datasets.

Weighted collapse for survey-style summaries

Weighted aggregation matters when records represent different population masses. Without weights, your averages can misrepresent target populations.

Always record which weight type is used and why; reviewers routinely ask for this justification.

collapse-weighted.do

stata

1clear all
2set obs 1200
3gen firm_id = ceil(_n/12)
4gen year = 2014 + mod(_n,10)
5gen wage = 22 + rnormal(0,3)
6gen education = 11 + floor(runiform()*8)
7
8preserve
9collapse (mean) mean_wage=wage mean_education=education          (count) n_workers=wage          (p50) med_wage=wage, by(firm_id year)
10
11isid firm_id year
12list firm_id year mean_wage n_workers in 1/6
13restore
14
15* ---- Section-specific continuation ----
16gen pop_weight = 0.5 + runiform()*3
17
18preserve
19collapse (mean) mean_wage=wage [aw=pop_weight]          (sum) total_weight=pop_weight, by(year)
20
21list year mean_wage total_weight
22restore

. list year mean_wage total_weight

     +--------------------------------+
     | year   mean_wage   total_weight |
     |--------------------------------|
  1. | 2014    22.19015      223.4188 |
  2. | 2015    21.97380      221.9557 |
  3. | 2016    22.10645      224.7311 |
     +--------------------------------+

⚠️Weight choice is inferential, not cosmetic

Do not switch between aweights and pweights casually. Your weighting choice changes target estimands and standard error interpretation.

Common Errors and Fixes

"varlist required"

collapse needs at least one variable-statistic pair. Calling collapse with only by() is invalid syntax.

Specify at least one statistic block such as `(mean) wage` before by().

. collapse, by(firm_id year)

varlist required
r(100);

This causes the error

wrong-way.do

stata

collapse, by(firm_id year)

This is the fix

right-way.do

stata

collapse (mean) wage education, by(firm_id year)

error-fix.do

stata

1preserve
2collapse (mean) mean_wage=wage mean_education=education, by(firm_id year)
3restore

. collapse (mean) mean_wage=wage mean_education=education, by(firm_id year)

(sum of wgt is 1,200)

. collapse (mean) mean_wage=wage mean_education=education, by(firm_id year)

Command Reference

collapse

Stata docs →

Aggregates the dataset to group-level observations with selected statistics.

collapse (stat) newvar=oldvar [(stat) ...], by(groupvars)

(mean)Group mean

(count)Nonmissing count

(p50)Median within group

by(varlist)Grouping keys for aggregated rows

How Sytra Handles This

Sytra can generate safe collapse pipelines with preserve/restore and explicit output names, reducing accidental data overwrites.

A direct natural-language prompt for this exact workflow:

sytra-prompt.txt

bash

Collapse worker-level data to firm-year means of wage and education, count workers, keep weighted means by year, and return a merge-ready dataset with firm_id year keys.

Sytra catches these errors before you run.

Sytra can generate safe collapse pipelines with preserve/restore and explicit output names, reducing accidental data overwrites.

Join the Waitlist →

FAQ

Does collapse overwrite my dataset?

Yes. collapse replaces the current data in memory, so wrap it with preserve/restore or save a temporary copy first.

Can collapse compute multiple statistics at once?

Yes. You can specify mean, count, sum, p50, and more in one command using parentheses by statistic.

How do I aggregate by more than one key?

List all grouping variables in `by()` such as `by(firm_id year)` to keep two-dimensional aggregation.

Written by Sytra Team

Research Engineering Team, Sytra AI

We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.

#Stata#Collapse#Aggregation#Data Management

Stata collapse: How to Aggregate Data with Examples

Quick Answer

Aggregate Data Without Destroying Your Pipeline

Compute multi-statistic aggregates by firm and year

Weighted collapse for survey-style summaries

Common Errors and Fixes

"varlist required"

Command Reference

collapse

How Sytra Handles This

Sytra catches these errors before you run.

FAQ

Does collapse overwrite my dataset?

Can collapse compute multiple statistics at once?

How do I aggregate by more than one key?

Enjoyed this article?

Related Guides

Stata egen Functions: Complete Reference with Examples for Every Function

Reshape in Stata: Wide to Long and Long to Wide with Real Panel Data

How to Merge Datasets in Stata: 1:1, m:1, 1:m with Complete Examples

Stata Loops: foreach and forvalues Tutorial with 20 Practical Examples

import delimited in Stata: CSV Imports Without Type Breaks