Survival Analysis in Stata: A Guide for Epidemiologists
A practical guide to survival analysis in Stata — from stset to Cox PH to competing risks. Written for public health researchers and epidemiologists.
Survival analysis is the backbone of epidemiological research. Whether you’re estimating time to disease onset, time to hospital readmission, or time to death after a clinical intervention, the analytical framework is the same: you have time-to-event data with censoring, and you need an estimator that handles both.
Stata’s survival analysis suite is among the most mature in any statistical software. But it requires a specific workflow — starting with stset — that is unlike anything in the rest of Stata. If you skip or misconfigure this step, every downstream analysis is wrong. And this is exactly where ChatGPT falls apart.
Step 1: stset — Declaring Survival Data
Before any survival analysis, you must tell Stata your data is survival data. The stset command is not optional — it defines the time variable, the failure event, and any entry time or censoring structure.
Key decisions at this stage:
- What is failure? — The
failure()option defines what counts as an event. Ifdied = 1means the event occurred anddied = 0means censored, specifyfailure(died). If the failure variable has multiple values (e.g., 1 = disease, 2 = death), specify which value:failure(event == 1). - Scale matters. — Is time in days, months, or years? The scale affects hazard ratio interpretation. If time is in days and hazard ratios are near 1.0001, consider rescaling to months.
- Check your stset: After running
stset, always runstsumto see the summary statistics and verify that the number of subjects, failures, and time at risk look correct.
Step 2: Descriptive Survival — Kaplan-Meier
The Kaplan-Meier curve is the standard visualization for survival data. Always include confidence intervals (ci). Always run the log-rank test. And always report median survival time — it’s more interpretable than the hazard ratio for non-technical audiences.
Stop fighting with syntax.
Sytra is an AI research assistant built specifically for statistical computing. No more copy-pasting code into ChatGPT.
Get Early AccessStep 3: Cox Proportional Hazards
The Cox model estimates hazard ratios: HR = 1.5 means the treatment group has a 50% higher instantaneous rate of the event at any given time, conditional on covariates. An HR < 1 means the treatment is protective.
Critical: the proportional hazards assumption.
The Cox model assumes that hazard ratios are constant over time. If the effect of treatment changes as time passes (e.g., a drug works well initially but wears off), the PH assumption is violated and the hazard ratio is misleading.
A significant p-value on estat phtest means the PH assumption is violated for that variable. Solutions: (1) stratify on the offending variable (stcox ..., strata(variable)), (2) include a time interaction, or (3) use a different model (parametric or accelerated failure time).
ChatGPT never runs this test. It generates stcox and stops. But a Cox model without a PH test is like a regression without checking residuals — you’re publishing results from a model whose key assumption may be violated.
Step 4: Competing Risks
In many studies, there are multiple ways to “fail.” A cancer patient might die from cancer, die from cardiovascular disease, or die from other causes. If you treat non-cancer deaths as censoring, you overestimate the cancer-specific hazard — because you’re assuming that patients who die from other causes would eventually have died from cancer given enough time.
The Fine-Gray model estimates subdistribution hazard ratios (SHR), which account for the competing risk. An SHR > 1 means the treatment group has a higher cumulative incidence of the event of interest, accounting for the fact that some patients experience the competing event instead.
Step 5: Parametric Models
When you have a theoretical reason to believe the hazard follows a specific functional form, parametric models can be more efficient than Cox:
How Sytra Handles Survival Analysis
Sytra understands the full survival analysis pipeline. When you say “run a Cox regression of treatment on time to readmission, adjusting for age, sex, and comorbidity,” it generates:
stsetwith the correct time and failure variablesstsumandstcifor descriptive statisticssts graphfor Kaplan-Meier curvesstcoxwithvce(robust)estat phtestto check the PH assumption- If PH is violated, it suggests stratification or a parametric alternative
If your data has competing risks, Sytra detects multiple failure types and suggests the Fine-Gray model. Because it understands the epidemiological methodology, not just the Stata syntax.