Stata String Functions: substr, strpos, regexm, and 30 More with Examples
Every string function you need in Stata โ from basic substr and trim to regex matching โ with copy-paste examples for data cleaning.
Your merge keys look identical in the spreadsheet, but hidden spaces and punctuation are splitting the sample in Stata.
You will build a reliable string-cleaning pipeline that turns messy text fields into merge-safe identifiers.
All examples tested in Stata 18 SE. Compatible with Stata 15+.
Quick Answer
- Normalize case with `lower()` or `upper()` first.
- Trim leading and trailing whitespace before pattern operations.
- Use `subinstr()` for fixed replacements and `regexm()` for complex patterns.
- Create cleaned keys in new variables and keep originals for audit.
Convert Text Noise into Stable Analysis Variables
Core string cleaning for IDs and names
String cleaning should be deterministic. The same raw input should always map to the same cleaned identifier across runs and collaborators.
A safe pattern is normalize case, trim spaces, remove punctuation, and then parse tokens.
If you are extending this pipeline, also review How to Merge Datasets in Stata and Stata Weights Explained.
1clear all2input str12 raw_firm str20 raw_city3" AB-001 " "San Diego, CA"4"ab-002" "Los Angeles, CA"5"AB 003" "Sacramento CA"6"ab/004" "Fresno,CA"7end89gen firm_clean = lower(trim(raw_firm))10replace firm_clean = subinstr(firm_clean, "-", "", .)11replace firm_clean = subinstr(firm_clean, " ", "", .)12replace firm_clean = subinstr(firm_clean, "/", "", .)1314gen city_clean = lower(trim(raw_city))15replace city_clean = subinstr(city_clean, ",", "", .)16replace city_clean = itrim(city_clean)1718list raw_firm firm_clean raw_city city_clean +---------------------------------------------------------+
| raw_firm firm_clean raw_city city_clean |
|---------------------------------------------------------|
1. | AB-001 ab001 San Diego, CA san diego ca |
2. | ab-002 ab002 Los Angeles, CA los angeles ca |
3. | AB 003 ab003 Sacramento CA sacramento ca |
4. | ab/004 ab004 Fresno,CA fresno ca |
+---------------------------------------------------------+Pattern extraction with regexm and regexs
Regex functions are useful for extracting structured tokens embedded in free text such as invoice codes, county IDs, or survey prefixes.
Use anchored patterns when possible to avoid accidental partial matches.
1clear all2input str12 raw_firm str20 raw_city3" AB-001 " "San Diego, CA"4"ab-002" "Los Angeles, CA"5"AB 003" "Sacramento CA"6"ab/004" "Fresno,CA"7end89gen firm_clean = lower(trim(raw_firm))10replace firm_clean = subinstr(firm_clean, "-", "", .)11replace firm_clean = subinstr(firm_clean, " ", "", .)12replace firm_clean = subinstr(firm_clean, "/", "", .)1314gen city_clean = lower(trim(raw_city))15replace city_clean = subinstr(city_clean, ",", "", .)16replace city_clean = itrim(city_clean)1718list raw_firm firm_clean raw_city city_clean1920* ---- Section-specific continuation ----21gen note = "firm=ab001 year=2020 state=CA"22replace note = "firm=ab002 year=2021 state=NV" in 223replace note = "firm=ab003 year=2022 state=AZ" in 324replace note = "firm=ab004 year=2023 state=CA" in 42526gen has_year = regexm(note, "year=[0-9]{4}")27gen extracted_year = real(regexs(1)) if regexm(note, "year=([0-9]{4})")28gen extracted_state = regexs(1) if regexm(note, "state=([A-Z]{2})")2930list note has_year extracted_year extracted_state +-----------------------------------------------------------+
| note has_year extracted_year extracted_state |
|-----------------------------------------------------------|
1. | firm=ab001 year=2020 state=CA 1 2020 CA |
2. | firm=ab002 year=2021 state=NV 1 2021 NV |
3. | firm=ab003 year=2022 state=AZ 1 2022 AZ |
4. | firm=ab004 year=2023 state=CA 1 2023 CA |
+-----------------------------------------------------------+Common Errors and Fixes
"type mismatch"
String functions were applied to numeric variables without conversion.
Use `describe` to confirm types and convert numerics to strings with `tostring` when needed.
type mismatch r(109);
gen first_digit = substr(firm_id,1,1)tostring firm_id, gen(firm_id_str)gen first_digit = substr(firm_id_str,1,1)1describe firm_id2tostring firm_id, gen(firm_id_str)3gen first_digit = substr(firm_id_str,1,1)4list firm_id firm_id_str first_digit in 1/5storage display value variable name type format label variable label ------------------------------------------------------------------------------- firm_id float %9.0g
Command Reference
string functions
Stata docs โTransforms and parses text variables for standardized analysis-ready fields.
substr(s,n,l)Extracts substring from position n of length lstrpos(s,t)Finds location of token t in string sregexm(s,pat)Pattern match indicatorsubinstr(s,a,b,.)Global replacement of token a with bHow Sytra Handles This
Sytra can convert plain-language parsing requests into tested string pipelines with type checks before function calls.
A direct natural-language prompt for this exact workflow:
Clean raw firm and city strings by removing punctuation and spaces, normalize case, extract year and state tokens with regex, and produce merge-safe keys.Sytra catches these errors before you run.
Sytra can convert plain-language parsing requests into tested string pipelines with type checks before function calls.
Join the Waitlist โFAQ
Which Stata string functions are used most in cleaning workflows?
substr, strpos, subinstr, trim, lower/upper, and regexm are the core functions for parsing IDs, names, and free-text fields.
When should I use regexm instead of strpos?
Use strpos for simple fixed tokens and regexm when pattern logic requires optional segments, anchors, or character classes.
How can I avoid type mismatch errors with string functions?
Always confirm variable type with describe and convert numeric variables to string using tostring before applying string-only functions.
Related Guides
- Stata Dates: Formatting, Converting, and Working with Date Variables
- Importing Data into Stata: Excel, CSV, Fixed-Width, SAS, and SPSS
- Stata Labels: Variable Labels, Value Labels, and label define
- Finding and Removing Duplicates in Stata: duplicates tag, report, drop
- Explore the data management pillar page
- Open the full data management guide index
- Browse all Stata & R guides on the blog index
- Browse all Stata pillars
We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.