Is Sytra free for researchers?

Yes. Sytra is free forever for individual researchers. You bring your own API key from OpenAI or Anthropic and pay only for the AI inference costs (typically $0.01-0.10 per query).

Does Sytra upload my data to the cloud?

No. Sytra runs entirely on your local machine. Your .dta files, .csv files, and code never leave your computer. Only the natural language prompt is sent to the AI provider.

What versions of Stata does Sytra support?

Sytra supports Stata 17 and later, including MP, SE, and BE editions.

Data Management

2026-02-198 min read

Finding and Removing Duplicates in Stata: duplicates tag, report, drop

Duplicates break merges, inflate standard errors, and corrupt analysis. Here's how to detect, understand, and remove them safely.

Sytra Team

Research Engineering Team, Sytra AI

Your merge fails because keys are duplicated, and dropping rows blindly would change the sample in unknown ways.

You will learn a safe deduplication protocol that is transparent, reproducible, and defensible in appendices.

All examples tested in Stata 18 SE. Compatible with Stata 15+.

Quick Answer

Run `duplicates report keyvars` to quantify duplicate burden.
Tag duplicate groups with `duplicates tag` and inspect them before dropping.
Apply deterministic retention logic (for example latest year or nonmissing priority).
Re-run uniqueness checks with `isid` after cleanup.

Deduplicate with an Audit Trail, Not Guesswork

Profile duplicate structure by analytic key

Duplicate handling should start with a clear key definition. In panel work, `firm_id year` is often the relevant uniqueness boundary.

Quantify duplicates first. Large duplicate counts can indicate upstream import or reshape problems that need structural fixes.

If you are extending this pipeline, also review reghdfe in Stata: Fixed Effects Tutorial and How to Structure a Stata Project.

duplicates-profile.do

stata

1clear all
2input firm_id year wage education
3101 2019 31 12
4101 2019 31 12
5101 2020 33 13
6102 2019 27 10
7102 2019 28 10
8103 2020 35 14
9end
10
11duplicates report firm_id year
12duplicates tag firm_id year, gen(dup)
13list firm_id year wage education dup, sepby(firm_id year)

. duplicates report firm_id year

Duplicates in terms of firm_id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |            2             0
        2 |            4             2
--------------------------------------

💡Tag before drop

Create duplicate tags and keep them in logs so you can document exactly how many records were affected.

Apply deterministic duplicate retention rules

If duplicates differ in key fields like wage, dropping first observation is arbitrary. Decide retention logic aligned with data provenance.

A common rule is keep record with highest nonmissing field count or most recent update timestamp.

duplicates-resolution.do

stata

1clear all
2input firm_id year wage education
3101 2019 31 12
4101 2019 31 12
5101 2020 33 13
6102 2019 27 10
7102 2019 28 10
8103 2020 35 14
9end
10
11duplicates report firm_id year
12duplicates tag firm_id year, gen(dup)
13list firm_id year wage education dup, sepby(firm_id year)
14
15* ---- Section-specific continuation ----
16* Example quality score: prefer nonmissing wage and education
17gen quality = !missing(wage) + !missing(education)
18
19bysort firm_id year (quality wage): keep if _n == _N
20
21drop dup quality
22isid firm_id year
23list firm_id year wage education

. isid firm_id year

. isid firm_id year
variables firm_id year uniquely identify the observations

⚠️Do not use force as first step

Using `duplicates drop, force` without inspection can remove valid records and bias panel outcomes.

Common Errors and Fixes

"variables firm_id year do not uniquely identify the observations"

You attempted a uniqueness-dependent operation while duplicates still exist in the chosen key.

Run duplicate diagnostics and resolve groups before merge or reshape commands that require unique keys.

. isid firm_id year

variables firm_id year do not uniquely identify the observations
r(459);

This causes the error

wrong-way.do

stata

isid firm_id year

This is the fix

right-way.do

stata

duplicates tag firm_id year, gen(dup)
list if dup>0
bysort firm_id year: keep if _n==1
isid firm_id year

error-fix.do

stata

1duplicates report firm_id year
2bysort firm_id year: gen seq = _n
3drop if seq>1
4drop seq
5isid firm_id year

. isid firm_id year

variables firm_id year uniquely identify the observations

Command Reference

duplicates

Stata docs →

Profiles and resolves duplicate observations by specified key variables.

duplicates [report | tag | list | drop] varlist [, force]

reportSummarizes duplicate copy counts

tag, gen()Creates indicator for duplicate membership

listDisplays duplicate groups for manual review

drop, forceDrops extra copies without additional checks

How Sytra Handles This

Sytra can propose deduplication rules from business logic and produce an audit table before any rows are removed.

A direct natural-language prompt for this exact workflow:

sytra-prompt.txt

bash

Detect duplicates by firm_id year, produce a duplicate report, generate a quality-based rule to keep one record per key, and verify uniqueness with isid.

Sytra catches these errors before you run.

Sytra can propose deduplication rules from business logic and produce an audit table before any rows are removed.

Join the Waitlist →

FAQ

Should I always use duplicates drop, force?

No. Force drop can remove records with meaningful differences in non-key fields. Diagnose duplicates first and define a deterministic retention rule.

What key should I use for duplicate checks?

Use the key implied by your design, often entity-time combinations like firm_id year in panel data.

How do I keep the highest-quality record per duplicate group?

Create a quality score, sort by it within key groups, and keep the top-ranked observation deterministically.

Written by Sytra Team

Research Engineering Team, Sytra AI

We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.

#Stata#Duplicates#Data Cleaning#Data Management

Finding and Removing Duplicates in Stata: duplicates tag, report, drop

Quick Answer

Deduplicate with an Audit Trail, Not Guesswork

Profile duplicate structure by analytic key

Apply deterministic duplicate retention rules

Common Errors and Fixes

"variables firm_id year do not uniquely identify the observations"

Command Reference

duplicates

How Sytra Handles This

Sytra catches these errors before you run.

FAQ

Should I always use duplicates drop, force?

What key should I use for duplicate checks?

How do I keep the highest-quality record per duplicate group?

Enjoyed this article?

Related Guides

How to Merge Datasets in Stata: 1:1, m:1, 1:m with Complete Examples

Stata 'not sorted' Error in Merge: The Fix That Takes 5 Seconds

Stata egen Functions: Complete Reference with Examples for Every Function

Stata Labels: Variable Labels, Value Labels, and label define

Importing Data into Stata: Excel, CSV, Fixed-Width, SAS, and SPSS