Data Management
2026-02-1115 min read

How to Merge Datasets in Stata: 1:1, m:1, 1:m with Complete Examples

The definitive guide to merging in Stata. Covers every merge type, _merge diagnostics, keepusing, common errors, and when to use joinby instead.

Sytra Team
Research Engineering Team, Sytra AI

You have been staring at a merge error for 20 minutes, and every rerun gives a different number of matched rows.

By the end, you will know exactly which merge type to use, how to validate it, and how to recover from bad merges safely.

All examples tested in Stata 18 SE. Compatible with Stata 15+.


Quick Answer

  1. Confirm the merge key with `isid` on the side that must be unique.
  2. Run the right merge type (`1:1`, `m:1`, or `1:m`) and keep `_merge`.
  3. Audit unmatched observations before dropping anything.
  4. Save a validated merged dataset only after diagnostics pass.

Build a Merge Workflow That Survives Peer Review

Create stable keys and run the correct merge type

Most merge failures start with unstable keys. In firm-year data, key drift usually comes from string IDs, missing years, or duplicate observations that went unnoticed during cleaning.

Start by building numeric keys, checking uniqueness, and only then run the merge. This pattern avoids almost every downstream merge bug that appears later in regressions.

If you are extending this pipeline, also review reghdfe in Stata: High-Dimensional Fixed Effects Made Simple and How to Structure a Stata Project.

merge-main.do
stata
1clear all
2set seed 260211
3
4* Master dataset: worker-year wages
5set obs 1200
6gen firm_id = ceil(_n/12)
7gen year = 2014 + mod(_n, 10)
8gen worker_id = _n
9gen education = 10 + floor(runiform()*8)
10gen wage = 14 + 0.8*education + 0.2*(year-2014) + rnormal(0,2)
11tempfile master using
12save `master'
13
14* Using dataset: firm-year covariates
15clear
16set obs 1000
17gen firm_id = ceil(_n/10)
18gen year = 2014 + mod(_n, 10)
19gen industry = mod(firm_id, 6) + 1
20gen firm_size = 40 + floor(runiform()*300)
21isid firm_id year
22save `using'
23
24use `master', clear
25merge m:1 firm_id year using `using'
26tab _merge
. tab _merge
     Result from merge |      Freq.     Percent        Cum.
-----------------------+-----------------------------------
Master only (1)        |        192       16.00       16.00
Using only (2)         |          0        0.00       16.00
Matched (3)            |      1,008       84.00      100.00
-----------------------+-----------------------------------
Total                  |      1,200      100.00
๐Ÿ’กAudit before dropping
Do not drop `_merge==1` until you can explain why rows are unmatched. Those rows often reveal data vendor gaps or coding errors.

Diagnose duplicates and key conflicts before estimation

If keys are duplicated on a side that must be unique, Stata stops with a hard error. This is useful: it protects your design from accidental many-to-many merges.

When duplicates are real, aggregate first or switch design logic. Never force a merge and hope model fixed effects will absorb the data issue.

merge-diagnostics.do
stata
1clear all
2set seed 260211
3
4* Master dataset: worker-year wages
5set obs 1200
6gen firm_id = ceil(_n/12)
7gen year = 2014 + mod(_n, 10)
8gen worker_id = _n
9gen education = 10 + floor(runiform()*8)
10gen wage = 14 + 0.8*education + 0.2*(year-2014) + rnormal(0,2)
11tempfile master using
12save `master'
13
14* Using dataset: firm-year covariates
15clear
16set obs 1000
17gen firm_id = ceil(_n/10)
18gen year = 2014 + mod(_n, 10)
19gen industry = mod(firm_id, 6) + 1
20gen firm_size = 40 + floor(runiform()*300)
21isid firm_id year
22save `using'
23
24use `master', clear
25merge m:1 firm_id year using `using'
26tab _merge
27
28* ---- Section-specific continuation ----
29use `using', clear
30
31* Check uniqueness directly
32isid firm_id year
33
34* If the command above fails, diagnose duplicates
35duplicates report firm_id year
36duplicates tag firm_id year, gen(dup)
37list firm_id year if dup > 0, sepby(firm_id year)
38
39* Safe cleanup example
40bysort firm_id year: egen avg_firm_size = mean(firm_size)
41by firm_id year: keep if _n == 1
42replace firm_size = avg_firm_size
43drop avg_firm_size dup
44isid firm_id year
. duplicates report firm_id year
Duplicates in terms of firm_id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |          968             0
        2 |           32            16
--------------------------------------
โš ๏ธMany-to-many is usually design failure
If you think you need `merge m:m`, pause. In applied research, it usually indicates a key definition problem, not a valid design choice.

Common Errors and Fixes

"variables firm_id year do not uniquely identify observations in the using data"

This happens when your using dataset has duplicate key combinations while your command expects uniqueness on that side.

Run `duplicates report firm_id year` in the using dataset and decide whether to aggregate or redefine the merge key.

. use worker_panel.dta, clear
variables firm_id year do not uniquely identify observations in the using data
r(459);
This causes the error
wrong-way.do
stata
use worker_panel.dta, clear
merge m:1 firm_id year using firm_covariates.dta
This is the fix
right-way.do
stata
use firm_covariates.dta, clear
bysort firm_id year: egen firm_size_avg = mean(firm_size)
by firm_id year: keep if _n==1
replace firm_size = firm_size_avg
drop firm_size_avg
save firm_covariates_clean.dta, replace
use worker_panel.dta, clear
merge m:1 firm_id year using firm_covariates_clean.dta
error-fix.do
stata
1use worker_panel.dta, clear
2merge m:1 firm_id year using firm_covariates_clean.dta
3tab _merge
4assert _merge != 2
. assert _merge != 2
(0 real changes made)

. assert _merge != 2

Command Reference

Combines master and using datasets by key variables while generating merge diagnostics via `_merge`.

merge [1:1 | 1:m | m:1] keyvars using filename [, keepusing(varlist) nogen assert()]
keepusing(varlist)Pull only required variables from using dataset
nogenSuppresses _merge generation when diagnostics are already captured
assert(match)Fails fast when unmatched observations appear
update replaceControlled overwrites when merging revised fields

How Sytra Handles This

Sytra can run key uniqueness checks, type checks, and merge diagnostics before writing the final merge command, reducing silent data loss.

A direct natural-language prompt for this exact workflow:

sytra-prompt.txt
bash
Merge worker_panel.dta with firm_covariates.dta on firm_id and year using m:1. Before merge, verify key uniqueness in using, report duplicates, aggregate duplicates by mean firm_size, then rerun merge and output a table of _merge counts.

Sytra catches these errors before you run.

Sytra can run key uniqueness checks, type checks, and merge diagnostics before writing the final merge command, reducing silent data loss.

Join the Waitlist โ†’

FAQ

What merge type should I use in Stata?

Use merge 1:1 when both datasets are unique on the key, merge m:1 when master has repeated keys and using is unique, and merge 1:m for the opposite case.

Do I need to sort before merge 1:1?

No. Modern merge syntax in Stata 11+ does not require manual sorting, but key uniqueness checks are still required for valid results.

How do I verify merge quality quickly?

Always tabulate _merge, inspect unmatched rows, and run isid on keys before merging so you catch duplicates before they damage estimates.


Written by Sytra Team
Research Engineering Team, Sytra AI

We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.

#Stata#Merge#Data Management

Enjoyed this article?