Public Health
2026-02-2811 min read

Survival Analysis in Stata: A Guide for Epidemiologists

A practical guide to survival analysis in Stata — from stset to Cox PH to competing risks. Written for public health researchers and epidemiologists.

Sytra Team
Research Engineering Team, Sytra AI

Survival analysis is the backbone of epidemiological research. Whether you’re estimating time to disease onset, time to hospital readmission, or time to death after a clinical intervention, the analytical framework is the same: you have time-to-event data with censoring, and you need an estimator that handles both.

Stata’s survival analysis suite is among the most mature in any statistical software. But it requires a specific workflow — starting with stset — that is unlike anything in the rest of Stata. If you skip or misconfigure this step, every downstream analysis is wrong. And this is exactly where ChatGPT falls apart.

Step 1: stset — Declaring Survival Data

Before any survival analysis, you must tell Stata your data is survival data. The stset command is not optional — it defines the time variable, the failure event, and any entry time or censoring structure.

* Basic stset: time variable + failure indicator
stset followup_time, failure(died)
 
* With late entry (left truncation)
stset followup_time, failure(died) enter(entry_time)
 
* With ID variable for multiple records per subject
stset followup_time, failure(died) id(patient_id)

Key decisions at this stage:

  • What is failure? — The failure() option defines what counts as an event. If died = 1 means the event occurred and died = 0 means censored, specify failure(died). If the failure variable has multiple values (e.g., 1 = disease, 2 = death), specify which value: failure(event == 1).
  • Scale matters. — Is time in days, months, or years? The scale affects hazard ratio interpretation. If time is in days and hazard ratios are near 1.0001, consider rescaling to months.
  • Check your stset: After running stset, always run stsum to see the summary statistics and verify that the number of subjects, failures, and time at risk look correct.

Step 2: Descriptive Survival — Kaplan-Meier

* Kaplan-Meier survival curves
sts graph, by(treatment) ci
 
* Log-rank test for equality of survival functions
sts test treatment
 
* Median survival time
stci, by(treatment)

The Kaplan-Meier curve is the standard visualization for survival data. Always include confidence intervals (ci). Always run the log-rank test. And always report median survival time — it’s more interpretable than the hazard ratio for non-technical audiences.

Stop fighting with syntax.

Sytra is an AI research assistant built specifically for statistical computing. No more copy-pasting code into ChatGPT.

Get Early Access

Step 3: Cox Proportional Hazards

* Cox PH model
stcox treatment age i.sex i.comorbidity, vce(robust)
 
* Report hazard ratios (default — but be explicit)
stcox treatment age i.sex i.comorbidity, vce(robust) hr

The Cox model estimates hazard ratios: HR = 1.5 means the treatment group has a 50% higher instantaneous rate of the event at any given time, conditional on covariates. An HR < 1 means the treatment is protective.

Critical: the proportional hazards assumption.

The Cox model assumes that hazard ratios are constant over time. If the effect of treatment changes as time passes (e.g., a drug works well initially but wears off), the PH assumption is violated and the hazard ratio is misleading.

* Test the proportional hazards assumption
estat phtest, detail
 
* Visual check: Schoenfeld residuals
estat phtest, plot(treatment)

A significant p-value on estat phtest means the PH assumption is violated for that variable. Solutions: (1) stratify on the offending variable (stcox ..., strata(variable)), (2) include a time interaction, or (3) use a different model (parametric or accelerated failure time).

ChatGPT never runs this test. It generates stcox and stops. But a Cox model without a PH test is like a regression without checking residuals — you’re publishing results from a model whose key assumption may be violated.

Step 4: Competing Risks

In many studies, there are multiple ways to “fail.” A cancer patient might die from cancer, die from cardiovascular disease, or die from other causes. If you treat non-cancer deaths as censoring, you overestimate the cancer-specific hazard — because you’re assuming that patients who die from other causes would eventually have died from cancer given enough time.

* Competing risks: Fine-Gray subdistribution hazard model
stset time, failure(cause == 1) id(patient_id)
 
* Fine-Gray model
stcrreg treatment age i.sex, compete(cause == 2)
 
* Cumulative incidence function
stcurve, cif at1(treatment = 0) at2(treatment = 1)

The Fine-Gray model estimates subdistribution hazard ratios (SHR), which account for the competing risk. An SHR > 1 means the treatment group has a higher cumulative incidence of the event of interest, accounting for the fact that some patients experience the competing event instead.

Step 5: Parametric Models

When you have a theoretical reason to believe the hazard follows a specific functional form, parametric models can be more efficient than Cox:

* Weibull model (common in engineering and epidemiology)
streg treatment age i.sex, distribution(weibull)
 
* Exponential model (constant hazard)
streg treatment age i.sex, distribution(exponential)
 
* Compare models with AIC/BIC
estimates stats .

How Sytra Handles Survival Analysis

Sytra understands the full survival analysis pipeline. When you say “run a Cox regression of treatment on time to readmission, adjusting for age, sex, and comorbidity,” it generates:

  1. stset with the correct time and failure variables
  2. stsum and stci for descriptive statistics
  3. sts graph for Kaplan-Meier curves
  4. stcox with vce(robust)
  5. estat phtest to check the PH assumption
  6. If PH is violated, it suggests stratification or a parametric alternative

If your data has competing risks, Sytra detects multiple failure types and suggests the Fine-Gray model. Because it understands the epidemiological methodology, not just the Stata syntax.

#Survival Analysis#Stata#Public Health#Epidemiology

Enjoyed this article?