Finding and Removing Duplicates in Stata: duplicates tag, report, drop
Duplicates break merges, inflate standard errors, and corrupt analysis. Here's how to detect, understand, and remove them safely.
Your merge fails because keys are duplicated, and dropping rows blindly would change the sample in unknown ways.
You will learn a safe deduplication protocol that is transparent, reproducible, and defensible in appendices.
All examples tested in Stata 18 SE. Compatible with Stata 15+.
Quick Answer
- Run `duplicates report keyvars` to quantify duplicate burden.
- Tag duplicate groups with `duplicates tag` and inspect them before dropping.
- Apply deterministic retention logic (for example latest year or nonmissing priority).
- Re-run uniqueness checks with `isid` after cleanup.
Deduplicate with an Audit Trail, Not Guesswork
Profile duplicate structure by analytic key
Duplicate handling should start with a clear key definition. In panel work, `firm_id year` is often the relevant uniqueness boundary.
Quantify duplicates first. Large duplicate counts can indicate upstream import or reshape problems that need structural fixes.
If you are extending this pipeline, also review reghdfe in Stata: Fixed Effects Tutorial and How to Structure a Stata Project.
1clear all2input firm_id year wage education3101 2019 31 124101 2019 31 125101 2020 33 136102 2019 27 107102 2019 28 108103 2020 35 149end1011duplicates report firm_id year12duplicates tag firm_id year, gen(dup)13list firm_id year wage education dup, sepby(firm_id year)Duplicates in terms of firm_id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 2 0
2 | 4 2
--------------------------------------Apply deterministic duplicate retention rules
If duplicates differ in key fields like wage, dropping first observation is arbitrary. Decide retention logic aligned with data provenance.
A common rule is keep record with highest nonmissing field count or most recent update timestamp.
1clear all2input firm_id year wage education3101 2019 31 124101 2019 31 125101 2020 33 136102 2019 27 107102 2019 28 108103 2020 35 149end1011duplicates report firm_id year12duplicates tag firm_id year, gen(dup)13list firm_id year wage education dup, sepby(firm_id year)1415* ---- Section-specific continuation ----16* Example quality score: prefer nonmissing wage and education17gen quality = !missing(wage) + !missing(education)1819bysort firm_id year (quality wage): keep if _n == _N2021drop dup quality22isid firm_id year23list firm_id year wage education. isid firm_id year variables firm_id year uniquely identify the observations
Common Errors and Fixes
"variables firm_id year do not uniquely identify the observations"
You attempted a uniqueness-dependent operation while duplicates still exist in the chosen key.
Run duplicate diagnostics and resolve groups before merge or reshape commands that require unique keys.
variables firm_id year do not uniquely identify the observations r(459);
isid firm_id yearduplicates tag firm_id year, gen(dup)list if dup>0bysort firm_id year: keep if _n==1isid firm_id year1duplicates report firm_id year2bysort firm_id year: gen seq = _n3drop if seq>14drop seq5isid firm_id yearvariables firm_id year uniquely identify the observations
Command Reference
duplicates
Stata docs โProfiles and resolves duplicate observations by specified key variables.
reportSummarizes duplicate copy countstag, gen()Creates indicator for duplicate membershiplistDisplays duplicate groups for manual reviewdrop, forceDrops extra copies without additional checksHow Sytra Handles This
Sytra can propose deduplication rules from business logic and produce an audit table before any rows are removed.
A direct natural-language prompt for this exact workflow:
Detect duplicates by firm_id year, produce a duplicate report, generate a quality-based rule to keep one record per key, and verify uniqueness with isid.Sytra catches these errors before you run.
Sytra can propose deduplication rules from business logic and produce an audit table before any rows are removed.
Join the Waitlist โFAQ
Should I always use duplicates drop, force?
No. Force drop can remove records with meaningful differences in non-key fields. Diagnose duplicates first and define a deterministic retention rule.
What key should I use for duplicate checks?
Use the key implied by your design, often entity-time combinations like firm_id year in panel data.
How do I keep the highest-quality record per duplicate group?
Create a quality score, sort by it within key groups, and keep the top-ranked observation deterministically.
Related Guides
- How to Merge Datasets in Stata: 1:1, m:1, 1:m with Complete Examples
- Stata 'not sorted' Error in Merge: The Fix That Takes 5 Seconds
- Stata egen Functions: Complete Reference with Examples for Every Function
- Stata Type Mismatch Error in Merge: String vs Numeric Key Variables
- Explore the data management pillar page
- Open the full data management guide index
- Browse all Stata & R guides on the blog index
- Browse all Stata pillars
We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.