Data Management
2026-02-1614 min read

Stata String Functions: substr, strpos, regexm, and 30 More with Examples

Every string function you need in Stata โ€” from basic substr and trim to regex matching โ€” with copy-paste examples for data cleaning.

Sytra Team
Research Engineering Team, Sytra AI

Your merge keys look identical in the spreadsheet, but hidden spaces and punctuation are splitting the sample in Stata.

You will build a reliable string-cleaning pipeline that turns messy text fields into merge-safe identifiers.

All examples tested in Stata 18 SE. Compatible with Stata 15+.


Quick Answer

  1. Normalize case with `lower()` or `upper()` first.
  2. Trim leading and trailing whitespace before pattern operations.
  3. Use `subinstr()` for fixed replacements and `regexm()` for complex patterns.
  4. Create cleaned keys in new variables and keep originals for audit.

Convert Text Noise into Stable Analysis Variables

Core string cleaning for IDs and names

String cleaning should be deterministic. The same raw input should always map to the same cleaned identifier across runs and collaborators.

A safe pattern is normalize case, trim spaces, remove punctuation, and then parse tokens.

If you are extending this pipeline, also review How to Merge Datasets in Stata and Stata Weights Explained.

string-core-cleaning.do
stata
1clear all
2input str12 raw_firm str20 raw_city
3" AB-001 " "San Diego, CA"
4"ab-002" "Los Angeles, CA"
5"AB 003" "Sacramento CA"
6"ab/004" "Fresno,CA"
7end
8
9gen firm_clean = lower(trim(raw_firm))
10replace firm_clean = subinstr(firm_clean, "-", "", .)
11replace firm_clean = subinstr(firm_clean, " ", "", .)
12replace firm_clean = subinstr(firm_clean, "/", "", .)
13
14gen city_clean = lower(trim(raw_city))
15replace city_clean = subinstr(city_clean, ",", "", .)
16replace city_clean = itrim(city_clean)
17
18list raw_firm firm_clean raw_city city_clean
. list raw_firm firm_clean raw_city city_clean
     +---------------------------------------------------------+
     | raw_firm   firm_clean          raw_city         city_clean |
     |---------------------------------------------------------|
  1. |  AB-001        ab001      San Diego, CA      san diego ca |
  2. |   ab-002       ab002    Los Angeles, CA    los angeles ca |
  3. |   AB 003       ab003      Sacramento CA      sacramento ca |
  4. |   ab/004       ab004          Fresno,CA          fresno ca |
     +---------------------------------------------------------+
๐Ÿ’กKeep raw columns
Retain raw string columns until final QC is signed off. This makes discrepancy tracing possible during replication review.

Pattern extraction with regexm and regexs

Regex functions are useful for extracting structured tokens embedded in free text such as invoice codes, county IDs, or survey prefixes.

Use anchored patterns when possible to avoid accidental partial matches.

string-regex.do
stata
1clear all
2input str12 raw_firm str20 raw_city
3" AB-001 " "San Diego, CA"
4"ab-002" "Los Angeles, CA"
5"AB 003" "Sacramento CA"
6"ab/004" "Fresno,CA"
7end
8
9gen firm_clean = lower(trim(raw_firm))
10replace firm_clean = subinstr(firm_clean, "-", "", .)
11replace firm_clean = subinstr(firm_clean, " ", "", .)
12replace firm_clean = subinstr(firm_clean, "/", "", .)
13
14gen city_clean = lower(trim(raw_city))
15replace city_clean = subinstr(city_clean, ",", "", .)
16replace city_clean = itrim(city_clean)
17
18list raw_firm firm_clean raw_city city_clean
19
20* ---- Section-specific continuation ----
21gen note = "firm=ab001 year=2020 state=CA"
22replace note = "firm=ab002 year=2021 state=NV" in 2
23replace note = "firm=ab003 year=2022 state=AZ" in 3
24replace note = "firm=ab004 year=2023 state=CA" in 4
25
26gen has_year = regexm(note, "year=[0-9]{4}")
27gen extracted_year = real(regexs(1)) if regexm(note, "year=([0-9]{4})")
28gen extracted_state = regexs(1) if regexm(note, "state=([A-Z]{2})")
29
30list note has_year extracted_year extracted_state
. list note has_year extracted_year extracted_state
     +-----------------------------------------------------------+
     | note                          has_year   extracted_year   extracted_state |
     |-----------------------------------------------------------|
  1. | firm=ab001 year=2020 state=CA        1            2020               CA |
  2. | firm=ab002 year=2021 state=NV        1            2021               NV |
  3. | firm=ab003 year=2022 state=AZ        1            2022               AZ |
  4. | firm=ab004 year=2023 state=CA        1            2023               CA |
     +-----------------------------------------------------------+
๐Ÿ‘regexs depends on prior regexm
Call regexm in the same expression context before regexs. Otherwise regexs may return stale matches from previous lines.

Common Errors and Fixes

"type mismatch"

String functions were applied to numeric variables without conversion.

Use `describe` to confirm types and convert numerics to strings with `tostring` when needed.

. gen first_digit = substr(firm_id,1,1)
type mismatch
r(109);
This causes the error
wrong-way.do
stata
gen first_digit = substr(firm_id,1,1)
This is the fix
right-way.do
stata
tostring firm_id, gen(firm_id_str)
gen first_digit = substr(firm_id_str,1,1)
error-fix.do
stata
1describe firm_id
2tostring firm_id, gen(firm_id_str)
3gen first_digit = substr(firm_id_str,1,1)
4list firm_id firm_id_str first_digit in 1/5
. describe firm_id
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
firm_id         float   %9.0g

Command Reference

string functions

Stata docs โ†’

Transforms and parses text variables for standardized analysis-ready fields.

gen newvar = lower(trim(subinstr(oldvar, "-", "", .)))
substr(s,n,l)Extracts substring from position n of length l
strpos(s,t)Finds location of token t in string s
regexm(s,pat)Pattern match indicator
subinstr(s,a,b,.)Global replacement of token a with b

How Sytra Handles This

Sytra can convert plain-language parsing requests into tested string pipelines with type checks before function calls.

A direct natural-language prompt for this exact workflow:

sytra-prompt.txt
bash
Clean raw firm and city strings by removing punctuation and spaces, normalize case, extract year and state tokens with regex, and produce merge-safe keys.

Sytra catches these errors before you run.

Sytra can convert plain-language parsing requests into tested string pipelines with type checks before function calls.

Join the Waitlist โ†’

FAQ

Which Stata string functions are used most in cleaning workflows?

substr, strpos, subinstr, trim, lower/upper, and regexm are the core functions for parsing IDs, names, and free-text fields.

When should I use regexm instead of strpos?

Use strpos for simple fixed tokens and regexm when pattern logic requires optional segments, anchors, or character classes.

How can I avoid type mismatch errors with string functions?

Always confirm variable type with describe and convert numeric variables to string using tostring before applying string-only functions.


Written by Sytra Team
Research Engineering Team, Sytra AI

We build practical, reproducible workflows for Stata and R teams working on real empirical research pipelines.

#Stata#Strings#Data Cleaning#Reference

Enjoyed this article?