Instrumental Variables in Stata: When and How
A guide to IV estimation in Stata — when to use instruments, how to test for weak instruments, and why ChatGPT gets the syntax wrong.
Instrumental variables is one of the most powerful tools in the econometrician’s arsenal — and one of the most frequently misapplied. A good instrument can solve endogeneity problems that no amount of controls can fix. A bad instrument produces estimates that are worse than OLS. The difference between the two is not always obvious, which is why the diagnostics matter as much as the estimation.
When You Need IV
You need instrumental variables when your key explanatory variable is correlated with the error term — i.e., when ordinary least squares is biased. This happens because of:
- Omitted variable bias: There’s an unobserved confounder that affects both your X and your Y.
- Reverse causality: Y causes X, not just the other way around.
- Measurement error: X is measured with noise, and that noise attenuates the coefficient.
An instrument Z must satisfy two conditions: (1) relevance — Z is correlated with X, and (2) exclusion — Z affects Y only through X. The first can be tested. The second cannot — it’s an untestable assumption that must be argued on theoretical grounds.
Basic 2SLS in Stata
The syntax: the endogenous variable x is inside parentheses, with the instruments z1 z2 after the equals sign. Exogenous controls go before the parentheses. Always use vce(robust) or vce(cluster clustvar) — IV-efficient standard errors are almost never valid in practice.
ChatGPT commonly generates:
The First Stage: Testing Instrument Strength
This is the single most important diagnostic in IV estimation. A weak instrument — one that is only weakly correlated with the endogenous variable — produces IV estimates that are biased toward OLS, have enormous standard errors, and can be wildly misleading.
The rule of thumb from Stock and Yogo (2005): the first-stage F-statistic should be at least 10. Below that, your instruments are weak and your IV estimates are unreliable. In recent work, Andrews, Stock, and Sun (2019) suggest that even F > 10 isn’t always sufficient — the exact threshold depends on the number of instruments and the desired maximum bias.
ChatGPT never generates this check. In every test we ran, ChatGPT produced the ivregress command and stopped. No first-stage diagnostics, no weak instrument test. This is like running a t-test without looking at your sample size — technically you get a number, but it might mean nothing.
Stop fighting with syntax.
Sytra is an AI research assistant built specifically for statistical computing. No more copy-pasting code into ChatGPT.
Get Early AccessOveridentification: The Hansen/Sargan Test
If you have more instruments than endogenous variables (overidentification), you can test whether the extra instruments are valid. The logic: if all instruments are truly exogenous, they should all give you the same answer. If they don’t, at least one is invalid.
A significant test statistic means at least one of your instruments fails the exclusion restriction. This is bad news — it means your IV estimates are not consistent.
Important nuance: If you used vce(robust), Stata reports the Hansen J test instead of the Sargan test. They test the same thing but under different assumptions (heteroskedasticity-robust vs. homoskedasticity). The Hansen J is almost always what you want.
The Full IV Pipeline
Here’s what a complete IV estimation in Stata actually looks like — the version ChatGPT never generates:
Five commands. ChatGPT gives you one of them. The other four are where the actual science happens.
Common Mistakes
- Too many instruments: With many instruments, the first-stage F can be large but the instruments can still be collectively weak. Use the Cragg-Donald statistic and Stock-Yogo critical values, not just the first-stage F from a single equation.
- Ignoring endogeneity: IV is only better than OLS if the endogeneity is real. The Durbin-Wu-Hausman test (
estat endogeneity) checks this. If the test is insignificant, stick with OLS — it’s more efficient. - Using
ivreginstead ofivregress: The oldivregcommand is deprecated. Always useivregress, which supportsvce()options and modern diagnostics. - Reporting only the second stage: Journals expect to see first-stage results. Use
estimates tableoresttabto report both stages.
How Sytra Handles IV
When you tell Sytra “estimate the effect of education on income using distance to college as an instrument,” it generates the full pipeline: ivregress with proper syntax, estat firststage, estat overid, and estat endogeneity. If the first-stage F is below 10, Sytra flags it with a warning. If the overidentification test rejects, it suggests revisiting your instrument set.
The AI doesn’t just write the regression command. It runs the entire inferential chain — because that’s what valid IV estimation requires.
Further Reading
- Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics. Princeton University Press. Chapter 4.
- Stock, J. H., & Yogo, M. (2005). “Testing for Weak Instruments in Linear IV Regression.” In Identification and Inference for Econometric Models.
- Andrews, I., Stock, J. H., & Sun, L. (2019). “Weak Instruments in Instrumental Variables Regression.” Annual Review of Economics, 11, 727-753.