Data Management
2026-02-1416 min read

Stata egen Functions: Complete Reference with Examples for Every Function

Every egen function in one place โ€” mean, total, count, max, min, rowmean, rowtotal, group, tag, rank โ€” with examples for each.

Sytra Team
Research Engineering Team, Sytra AI

You need grouped means, tags, and row totals in one script, but each analyst on your team uses different ad hoc code.

You will get a single egen playbook with reproducible patterns that scale from cleaning to estimation prep.

All examples tested in Stata 18 SE. Compatible with Stata 15+.


Quick Answer

  1. Use `egen` for grouped, row-wise, and tagging functions unavailable in plain `generate`.
  2. Pair `bysort` with egen to avoid accidental cross-group calculations.
  3. Validate derived variables with quick summaries and duplicates checks.
  4. Prefer one clear egen pass over repeated patch edits.

Standardize Derived Variables Before Modeling

Compute grouped statistics with bysort + egen

Grouped statistics are a frequent source of silent errors when analysts forget sorting or grouping logic. egen handles this cleanly with explicit group context.

For firm-year or school-cohort work, calculate group means once and document how they were built.

If you are extending this pipeline, also review How to Merge Datasets in Stata and Export Regression Tables in Stata: esttab Tutorial.

egen-grouped-stats.do
stata
1clear all
2set obs 600
3gen firm_id = ceil(_n/6)
4gen year = 2015 + mod(_n,8)
5gen wage = 18 + rnormal(0,4)
6gen education = 9 + floor(runiform()*9)
7
8bysort firm_id: egen firm_mean_wage = mean(wage)
9bysort year: egen year_mean_education = mean(education)
10bysort firm_id year: egen n_firm_year = count(wage)
11
12summarize firm_mean_wage year_mean_education n_firm_year
. summarize firm_mean_wage year_mean_education n_firm_year
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
firm_mean_~e |        600    18.04231    1.581243    13.9042    22.1194
year_mean_~n |        600    12.97167    .3678021    12.4211    13.4667
n_firm_year  |        600         1.2    .4477325          1          2
๐Ÿ’กName derived variables explicitly
Use prefixes like `firm_` or `year_` so collaborators can tell whether a variable is raw or derived without scanning your do-file.

Row-wise and tagging functions for QA workflows

Row-wise functions are useful when combining multiple survey items into indices. Tagging functions support duplicate audits and sample construction.

These functions are reliable if you keep variable lists explicit and verify edge cases like missing values.

egen-row-tag.do
stata
1clear all
2set obs 600
3gen firm_id = ceil(_n/6)
4gen year = 2015 + mod(_n,8)
5gen wage = 18 + rnormal(0,4)
6gen education = 9 + floor(runiform()*9)
7
8bysort firm_id: egen firm_mean_wage = mean(wage)
9bysort year: egen year_mean_education = mean(education)
10bysort firm_id year: egen n_firm_year = count(wage)
11
12summarize firm_mean_wage year_mean_education n_firm_year
13
14* ---- Section-specific continuation ----
15gen score_math = floor(runiform()*100)
16gen score_read = floor(runiform()*100)
17gen score_science = floor(runiform()*100)
18
19egen score_total = rowtotal(score_math score_read score_science)
20egen score_mean = rowmean(score_math score_read score_science)
21egen firm_tag = tag(firm_id)
22egen wage_rank = rank(wage), by(year)
23
24list firm_id year score_total score_mean firm_tag wage_rank in 1/8
. list firm_id year score_total score_mean firm_tag wage_rank in 1/8
     +---------------------------------------------------+
     | firm_id   year   score_total   score_mean   firm_tag   wage_rank |
     |---------------------------------------------------|
  1. |       1   2015          196     65.33333          1          43 |
  2. |       1   2016          168           56          0          51 |
  3. |       1   2017          214     71.33333          0          66 |
     +---------------------------------------------------+
๐Ÿ‘rowtotal handles missing differently
rowtotal skips missing values by default. If all inputs are missing, the result is 0, so verify whether that is appropriate for your design.

Common Errors and Fixes

"unknown egen function rowmeans()"

The function name is misspelled. egen functions are strict and often differ from expected plural forms.

Run `help egen` and copy function names exactly; rowmean is singular.

. egen avg_score = rowmeans(score_math score_read score_science)
unknown egen function rowmeans()
r(133);
This causes the error
wrong-way.do
stata
egen avg_score = rowmeans(score_math score_read score_science)
This is the fix
right-way.do
stata
egen avg_score = rowmean(score_math score_read score_science)
error-fix.do
stata
1capture drop avg_score
2egen avg_score = rowmean(score_math score_read score_science)
3summarize avg_score
. summarize avg_score
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
   avg_score |        600    49.84111    17.94028          5     95.667

Command Reference

Creates derived variables using grouped, row-wise, and specialized functions.

egen newvar = function(arguments) [, by(groupvars)]
by()Apply function within groups
rowtotal()Row-wise sum across listed variables
tag()Flags first observation in each group
rank(), by()Within-group ranking for percentile work

How Sytra Handles This

Sytra can translate grouped-statistics requests into exact egen patterns and flag when collapse might be more efficient.

A direct natural-language prompt for this exact workflow:

sytra-prompt.txt
bash
Generate firm-level and year-level grouped summaries with egen, then build row-wise score indices and a duplicate tag variable for firm_id.

Sytra catches these errors before you run.

Sytra can translate grouped-statistics requests into exact egen patterns and flag when collapse might be more efficient.

Join the Waitlist โ†’

FAQ

What is the difference between gen and egen in Stata?

gen computes observation-level expressions, while egen adds grouped and row-wise functions such as mean by group, rowtotal, tag, and rank.

Can egen be slow on large datasets?

It can be slower than specialized commands on very large panels. Use bysort with efficient grouping and avoid repeated egen calls when one pass is enough.

How do I compute group means without collapsing data?

Use `bysort group: egen newvar = mean(oldvar)` so your original row-level data stays intact.


Written by Sytra Team
Research Engineering Team, Sytra AI

We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.

#Stata#egen#Data Management#Reference

Enjoyed this article?