logistic regression power calculation - Aaron Graves, PhDude Replica

Estimating the required sample size for a logistic regression model is a critical step in clinical research and data science. Without a proper power calculation, you risk running a study that is "underpowered," meaning you might fail to detect a significant effect even if one truly exists.

Sample Size Calculator

Based on Hsieh et al. (1998) method for a single continuous covariate.

Baseline Probability (P0) The probability of the event when the predictor is at its mean.

Odds Ratio (OR) The expected effect size you wish to detect.

Desired Power (1 - β)

Significance Level (α)

R² with Other Covariates Enter 0 for simple logistic regression. Increase if other variables explain the predictor.

Required Total Sample Size (N): 0

Note: This calculation assumes a normally distributed predictor.

Understanding Power in Logistic Regression

In statistical modeling, power is the probability of correctly rejecting a null hypothesis when the alternative hypothesis is true. For logistic regression, this translates to the likelihood of finding a statistically significant relationship between your independent variable and a binary outcome.

Key Parameters Explained

Baseline Probability (P0): This is the frequency of the "event" occurring in your population. If you are studying a rare disease that occurs in 5% of people, your P0 is 0.05. Lower baseline probabilities generally require much larger sample sizes.
Odds Ratio (OR): This represents the strength of the association. An OR of 1.5 means that for every unit increase in the predictor, the odds of the outcome increase by 50%. Smaller ORs (closer to 1.0) require more data to detect.
R-Squared (Covariates): In multiple logistic regression, your primary variable of interest may be correlated with other variables in the model. The higher this correlation (R²), the more the "effective" sample size is reduced, requiring a larger total N to compensate.

How the Calculation Works

This calculator utilizes the standard approximation formula for a continuous covariate with a normal distribution. The formula integrates the Z-scores for your chosen alpha and power, the log of the Odds Ratio, and the baseline probability.

The adjustment for multiple regression is handled by the "Variance Inflation Factor" method, where the required sample size for a simple regression is divided by (1 - R²), where R² is the coefficient of determination when the predictor of interest is regressed on all other covariates.

Practical Tips for Researchers

When performing a logistic regression power calculation, it is always wise to:

Be conservative: Use a slightly smaller Odds Ratio than you expect to ensure your study isn't underpowered if the effect is weaker than anticipated.
Account for Attrition: If you calculate a need for 500 participants, but expect a 10% drop-out rate, you should aim to recruit at least 556 participants.
Check Distributions: If your predictor is not normally distributed (e.g., it's binary or highly skewed), the sample size requirements may differ.