Data Management
2026-02-198 min read

Finding and Removing Duplicates in Stata: duplicates tag, report, drop

Duplicates break merges, inflate standard errors, and corrupt analysis. Here's how to detect, understand, and remove them safely.

Sytra Team
Research Engineering Team, Sytra AI

Your merge fails because keys are duplicated, and dropping rows blindly would change the sample in unknown ways.

You will learn a safe deduplication protocol that is transparent, reproducible, and defensible in appendices.

All examples tested in Stata 18 SE. Compatible with Stata 15+.


Quick Answer

  1. Run `duplicates report keyvars` to quantify duplicate burden.
  2. Tag duplicate groups with `duplicates tag` and inspect them before dropping.
  3. Apply deterministic retention logic (for example latest year or nonmissing priority).
  4. Re-run uniqueness checks with `isid` after cleanup.

Deduplicate with an Audit Trail, Not Guesswork

Profile duplicate structure by analytic key

Duplicate handling should start with a clear key definition. In panel work, `firm_id year` is often the relevant uniqueness boundary.

Quantify duplicates first. Large duplicate counts can indicate upstream import or reshape problems that need structural fixes.

If you are extending this pipeline, also review reghdfe in Stata: Fixed Effects Tutorial and How to Structure a Stata Project.

duplicates-profile.do
stata
1clear all
2input firm_id year wage education
3101 2019 31 12
4101 2019 31 12
5101 2020 33 13
6102 2019 27 10
7102 2019 28 10
8103 2020 35 14
9end
10
11duplicates report firm_id year
12duplicates tag firm_id year, gen(dup)
13list firm_id year wage education dup, sepby(firm_id year)
. duplicates report firm_id year
Duplicates in terms of firm_id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |            2             0
        2 |            4             2
--------------------------------------
๐Ÿ’กTag before drop
Create duplicate tags and keep them in logs so you can document exactly how many records were affected.

Apply deterministic duplicate retention rules

If duplicates differ in key fields like wage, dropping first observation is arbitrary. Decide retention logic aligned with data provenance.

A common rule is keep record with highest nonmissing field count or most recent update timestamp.

duplicates-resolution.do
stata
1clear all
2input firm_id year wage education
3101 2019 31 12
4101 2019 31 12
5101 2020 33 13
6102 2019 27 10
7102 2019 28 10
8103 2020 35 14
9end
10
11duplicates report firm_id year
12duplicates tag firm_id year, gen(dup)
13list firm_id year wage education dup, sepby(firm_id year)
14
15* ---- Section-specific continuation ----
16* Example quality score: prefer nonmissing wage and education
17gen quality = !missing(wage) + !missing(education)
18
19bysort firm_id year (quality wage): keep if _n == _N
20
21drop dup quality
22isid firm_id year
23list firm_id year wage education
. isid firm_id year
. isid firm_id year
variables firm_id year uniquely identify the observations
โš ๏ธDo not use force as first step
Using `duplicates drop, force` without inspection can remove valid records and bias panel outcomes.

Common Errors and Fixes

"variables firm_id year do not uniquely identify the observations"

You attempted a uniqueness-dependent operation while duplicates still exist in the chosen key.

Run duplicate diagnostics and resolve groups before merge or reshape commands that require unique keys.

. isid firm_id year
variables firm_id year do not uniquely identify the observations
r(459);
This causes the error
wrong-way.do
stata
isid firm_id year
This is the fix
right-way.do
stata
duplicates tag firm_id year, gen(dup)
list if dup>0
bysort firm_id year: keep if _n==1
isid firm_id year
error-fix.do
stata
1duplicates report firm_id year
2bysort firm_id year: gen seq = _n
3drop if seq>1
4drop seq
5isid firm_id year
. isid firm_id year
variables firm_id year uniquely identify the observations

Command Reference

Profiles and resolves duplicate observations by specified key variables.

duplicates [report | tag | list | drop] varlist [, force]
reportSummarizes duplicate copy counts
tag, gen()Creates indicator for duplicate membership
listDisplays duplicate groups for manual review
drop, forceDrops extra copies without additional checks

How Sytra Handles This

Sytra can propose deduplication rules from business logic and produce an audit table before any rows are removed.

A direct natural-language prompt for this exact workflow:

sytra-prompt.txt
bash
Detect duplicates by firm_id year, produce a duplicate report, generate a quality-based rule to keep one record per key, and verify uniqueness with isid.

Sytra catches these errors before you run.

Sytra can propose deduplication rules from business logic and produce an audit table before any rows are removed.

Join the Waitlist โ†’

FAQ

Should I always use duplicates drop, force?

No. Force drop can remove records with meaningful differences in non-key fields. Diagnose duplicates first and define a deterministic retention rule.

What key should I use for duplicate checks?

Use the key implied by your design, often entity-time combinations like firm_id year in panel data.

How do I keep the highest-quality record per duplicate group?

Create a quality score, sort by it within key groups, and keep the top-ranked observation deterministically.


Written by Sytra Team
Research Engineering Team, Sytra AI

We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.

#Stata#Duplicates#Data Cleaning#Data Management

Enjoyed this article?