Workflow
2026-02-117 min read

How to Structure a Stata Project: Directory Layout, Naming, and Automation

A clean Stata project structure saves you hours. Here's the directory layout, naming conventions, and master .do file template used by top economics departments.

Sytra Team
Research Engineering Team, Sytra AI

A clean project structure is the difference between “I can resume this analysis after three months” and “I need to start over.” Most researchers learn this the hard way — usually during revisions, when a referee asks for a robustness check and you can’t find the right .do file.

project/
├── master.do
├── config.do
├── code/
│ ├── 01_import.do
│ ├── 02_clean.do
│ ├── 03_construct.do
│ ├── 04_analysis.do
│ ├── 05_robustness.do
│ └── 06_tables_figures.do
├── data/
│ ├── raw/ ← never modify
│ ├── intermediate/ ← generated by code
│ └── analysis/ ← final analysis datasets
├── output/
│ ├── tables/
│ └── figures/
├── docs/
│ ├── codebook.md
│ └── variable_definitions.xlsx
└── logs/

Naming Conventions

  • Numbered prefixes: 01_, 02_ etc. enforce execution order.
  • Descriptive names: 03_construct_treatment_vars.do not analysis_v2_new.do.
  • No spaces in file names. Use underscores.
  • Data files: Include the date or version — panel_data_2024.dta — but only in the raw folder. Intermediate files are regenerated.

The config.do Pattern

* config.do — Shared settings across all scripts
version 18
clear all
set more off
set scheme s2color
 
* Path globals
global root "/Users/researcher/projects/my_paper"
global code "$root/code"
global raw "$root/data/raw"
global inter "$root/data/intermediate"
global analysis "$root/data/analysis"
global tables "$root/output/tables"
global figures "$root/output/figures"
global logs "$root/logs"
 
* Analysis parameters
global controls "age i.race education income"
global cluster_var "state"
global sample_restriction "if year >= 2000 & year <= 2020"

Stop fighting with syntax.

Sytra is an AI research assistant built specifically for statistical computing. No more copy-pasting code into ChatGPT.

Get Early Access

Every .do file starts with do "$root/config.do". Change a path in one place, and it propagates everywhere. Change the control variables, and every regression updates.

Rules for Raw Data

  1. Never modify raw data files. They are read-only inputs.
  2. Document provenance. Where did each file come from? When was it downloaded? What’s the URL?
  3. Include checksums. Run datasignature after loading raw data to generate a hash you can verify later.
* Verify raw data hasn’t changed
use "$raw/census_2020.dta", clear
datasignature confirm

The Master .do File Pattern

* master.do
do "$root/config.do"
 
cap log close
log using "$logs/master_`c(current_date)'.log", replace
 
timer clear
timer on 1
 
do "$code/01_import.do"
do "$code/02_clean.do"
do "$code/03_construct.do"
do "$code/04_analysis.do"
do "$code/05_robustness.do"
do "$code/06_tables_figures.do"
 
timer off 1
timer list
log close

Run do master.do from a fresh Stata session. If it completes without error, your results are reproducible. If it doesn’t, fix it until it does.

Version Control

Put your code in Git. Not your data (unless it’s small). Not your output (it’s regenerated). Just the code and the config.

* .gitignore for Stata projects
data/intermediate/
data/analysis/
output/
logs/
*.log
*.smcl
#Stata#Workflow#Reproducibility

Enjoyed this article?