Statsmodels: Statistical Modeling and Econometrics
概述
Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.
使用時機
適用於以下情境:
- Fitting regression models (OLS, WLS, GLS, quantile regression)
- Performing generalized linear modeling (logistic, Poisson, Gamma, etc.)
- Analyzing discrete outcomes (binary, multinomial, count, ordinal)
- Conducting time series analysis (ARIMA, SARIMAX, VAR, forecasting)
- Running statistical tests and diagnostics
- Testing model assumptions (heteroskedasticity, autocorrelation, normality)
- Detecting outliers and influential observations
- Comparing models (AIC/BIC, likelihood ratio tests)
- Estimating causal effects
- Producing publication-ready statistical tables and inference
Quick Start Guide
Linear Regression (OLS)
import statsmodels.api as sm
import numpy as np
import pandas as pd
# Prepare data - ALWAYS add constant for intercept
X = sm.add_constant(X_data)
# Fit OLS model
model = sm.OLS(y, X)
results = model.fit()
# View comprehensive results
print(results.summary())
# Key results
print(f"R-squared: {results.rsquared:.4f}")
print(f"Coefficients:\\n{results.params}")
print(f"P-values:\\n{results.pvalues}")
# Predictions with confidence intervals
predictions = results.get_prediction(X_new)
pred_summary = predictions.summary_frame()
print(pred_summary) # includes mean, CI, prediction intervals
# Diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(results.resid, X)
print(f"Breusch-Pagan p-value: {bp_test[1]:.4f}")
# Visualize residuals
import matplotlib.pyplot as plt
plt.scatter(results.fittedvalues, results.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()
Logistic Regression (Binary Outcomes)
from statsmodels.discrete.discrete_model import Logit
# Add constant
X = sm.add_constant(X_data)
# Fit logit model
model = Logit(y_binary, X)
results = model.fit()
print(results.summary())
# Odds ratios
odds_ratios = np.exp(results.params)
print("Odds ratios:\\n", odds_ratios)
# Predicted probabilities
probs = results.predict(X)
# Binary predictions (0.5 threshold)
predictions = (probs > 0.5).astype(int)
# Model evaluation
from sklearn.metrics import classification_report, roc_auc_score
print(classification_report(y_binary, predictions))
print(f"AUC: {roc_auc_score(y_binary, probs):.4f}")
# Marginal effects
marginal = results.get_margeff()
print(marginal.summary())
Time Series (ARIMA)
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Check stationarity
from statsmodels.tsa.stattools import adfuller
adf_result = adfuller(y_series)
print(f"ADF p-value: {adf_result[1]:.4f}")
if adf_result[1] > 0.05:
# Series is non-stationary, difference it
y_diff = y_series.diff().dropna()
# Plot ACF/PACF to identify p, q
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(y_diff, lags=40, ax=ax1)
plot_pacf(y_diff, lags=40, ax=ax2)
plt.show()
# Fit ARIMA(p,d,q)
model = ARIMA(y_series, order=(1, 1, 1))
results = model.fit()
print(results.summary())
# Forecast
forecast = results.forecast(steps=10)
forecast_obj = results.get_forecast(steps=10)
forecast_df = forecast_obj.summary_frame()
print(forecast_df) # includes mean and confidence intervals
# Residual diagnostics
results.plot_diagnostics(figsize=(12, 8))
plt.show()
Generalized Linear Models (GLM)
import statsmodels.api as sm
# Poisson regression for count data
X = sm.add_constant(X_data)
model = sm.GLM(y_counts, X, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
# Rate ratios (for Poisson with log link)
rate_ratios = np.exp(results.params)
print("Rate ratios:\\n", rate_ratios)
# Check overdispersion
overdispersion = results.pearson_chi2 / results.df_resid
print(f"Overdispersion: {overdispersion:.2f}")
if overdispersion > 1.5:
# Use Negative Binomial instead
from statsmodels.discrete.count_model import NegativeBinomial
nb_model = NegativeBinomial(y_counts, X)
nb_results = nb_model.fit()
print(nb_results.summary())
Core Statistical Modeling Capabilities
1. Linear Regression Models
Comprehensive suite of linear models for continuous outcomes with various error structures.
Available models:
- OLS: Standard linear regression with i.i.d. errors
- WLS: Weighted least squares for heteroskedastic errors
- GLS: Generalized least squares for arbitrary covariance structure
- GLSAR: GLS with autoregressive errors for time series
- Quantile Regression: Conditional quantiles (robust to outliers)
- Mixed Effects: Hierarchical/multilevel models with random effects
- Recursive/Rolling: Time-varying parameter estimation
Key features:
- Comprehensive diagnostic tests
- Robust standard errors (HC, HAC, cluster-robust)
- Influence statistics (Cook's distance, leverage, DFFITS)
- Hypothesis testing (F-tests, Wald tests)
- Model comparison (AIC, BIC, likelihood ratio tests)
- Prediction with confidence and prediction intervals
When to use: Continuous outcome variable, want inference on coefficients, need diagnostics
Reference: See references/linear_models.md for detailed guidance on model selection, diagnostics, and best practices.
2. Generalized Linear Models (GLM)
Flexible framework extending linear models to non-normal distributions.
Distribution families:
- Binomial: Binary outcomes or proportions (logistic regression)
- Poisson: Count data
- Negative Binomial: Overdispersed counts
- Gamma: Positive continuous, right-skewed data
- Inverse Gaussian: Positive continuous with specific variance structure
- Gaussian: Equivalent to OLS
- Tweedie: Flexible family for semi-continuous data
Link functions:
- Logit, Probit, Log, Identity, Inverse, Sqrt, CLogLog, Power
- Choose based on interpretation needs and model fit
Key features:
- Maximum likelihood estimation via IRLS
- Deviance and Pearson residuals
- Goodness-of-fit statistics
- Pseudo R-squared measures
- Robust standard errors
When to use: Non-normal outcomes, need flexible variance and link specifications
Reference: See references/glm.md for family selection, link functions, interpretation, and diagnostics.
3. Discrete Choice Models
Models for categorical and count outcomes.
Binary models:
- Logit: Logistic regression (odds ratios)
- Probit: Probit regression (normal distribution)
Multinomial models:
- MNLogit: Unordered categories (3+ levels)
- Conditional Logit: Choice models with alternative-specific variables
- Ordered Model: Ordinal outcomes (ordered categories)
Count models:
- Poisson: Standard count model
- Negative Binomial: Overdispersed counts
- Zero-Inflated: Excess zeros (ZIP, ZINB)
- Hurdle Models: Two-stage models for zero-heavy data
Key features:
- Maximum likelihood estimation
- Marginal effects at means or average marginal effects
- Model comparison via AIC/BIC
- Predicted probabilities and classification
- Goodness-of-fit tests
When to use: Binary, categorical, or count outcomes
Reference: See references/discrete_choice.md for model selection, interpretation, and evaluation.
4. Time Series Analysis
Comprehensive time series modeling and forecasting capabilities.
Univariate models:
- AutoReg (AR): Autoregressive models
- ARIMA: Autoregressive integrated moving average
- SARIMAX: Seasonal ARIMA with exogenous variables
- Exponential Smoothing: Simple, Holt, Holt-Winters
- ETS: Innovations state space models
Multivariate models:
- VAR: Vector autoregression
- VARMAX: VAR with MA and exogenous variables
- Dynamic Factor Models: Extract common factors
- VECM: Vector error correction models (cointegration)
Advanced models:
- State Space: Kalman filtering, custom specifications
- Regime Switching: Markov switching models
- ARDL: Autoregressive distributed lag
Key features:
- ACF/PACF analysis for model identification
- Stationarity tests (ADF, KPSS)
- Forecasting with prediction intervals
- Residual diagnostics (Ljung-Box, heteroskedasticity)
- Granger causality testing
- Impulse response functions (IRF)
- Forecast error variance decomposition (FEVD)
When to use: Time-ordered data, forecasting, understanding temporal dynamics
Reference: See references/time_series.md for model selection, diagnostics, and forecasting methods.
5. Statistical Tests and Diagnostics
Extensive testing and diagnostic capabilities for model validation.
Residual diagnostics:
- Autocorrelation tests (Ljung-Box, Durbin-Watson, Breusch-Godfrey)
- Heteroskedasticity tests (Breusch-Pagan, White, ARCH)
- Normality tests (Jarque-Bera, Omnibus, Anderson-Darling, Lilliefors)
- Specification tests (RESET, Harvey-Collier)
Influence and outliers:
- Leverage (hat values)
- Cook's distance
- DFFITS and DFBETAs
- Studentized residuals
- Influence plots
Hypothesis testing:
- t-tests (one-sample, two-sample, paired)
- Proportion tests
- Chi-square tests
- Non-parametric tests (Mann-Whitney, Wilcoxon, Kruskal-Wallis)
- ANOVA (one-way, two-way, repeated measures)
Multiple comparisons:
- Tukey's HSD
- Bonferroni correction
- False Discovery Rate (FDR)
Effect sizes and power:
- Cohen's d, eta-squared
- Power analysis for t-tests, proportions
- Sample size calculations
Robust inference:
- Heteroskedasticity-consistent SEs (HC0-HC3)
- HAC standard errors (Newey-West)
- Cluster-robust standard errors
When to use: Validating assumptions, detecting problems, ensuring robust inference
Reference: See references/stats_diagnostics.md for comprehensive testing and diagnostic procedures.
Formula API (R-style)
Statsmodels supports R-style formulas for intuitive model specification:
import statsmodels.formula.api as smf
# OLS with formula
results = smf.ols('y ~ x1 + x2 + x1:x2', data=df).fit()
# Categorical variables (automatic dummy coding)
results = smf.ols('y ~ x1 + C(category)', data=df).fit()
# Interactions
results = smf.ols('y ~ x1 * x2', data=df).fit() # x1 + x2 + x1:x2
# Polynomial terms
results = smf.ols('y ~ x + I(x**2)', data=df).fit()
# Logit
results = smf.logit('y ~ x1 + x2 + C(group)', data=df).fit()
# Poisson
results = smf.poisson('count ~ x1 + x2', data=df).fit()
# ARIMA (not available via formula, use regular API)
Model Selection and Comparison
Information Criteria
# Compare models using AIC/BIC
models = {
'Model 1': model1_results,
'Model 2': model2_results,
'Model 3': model3_results
}
comparison = pd.DataFrame({
'AIC': {name: res.aic for name, res in models.items()},
'BIC': {name: res.bic for name, res in models.items()},
'Log-Likelihood': {name: res.llf for name, res in models.items()}
})
print(comparison.sort_values('AIC'))
# Lower AIC/BIC indicates better model
Likelihood Ratio Test (Nested Models)
# For nested models (one is subset of the other)
from scipy import stats
lr_stat = 2 * (full_model.llf - reduced_model.llf)
df = full_model.df_model - reduced_model.df_model
p_value = 1 - stats.chi2.cdf(lr_stat, df)
print(f"LR statistic: {lr_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Full model significantly better")
else:
print("Reduced model preferred (parsimony)")
Cross-Validation
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = []
for train_idx, val_idx in kf.split(X):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
# Fit model
model = sm.OLS(y_train, X_train).fit()
# Predict
y_pred = model.predict(X_val)
# Score
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
cv_scores.append(rmse)
print(f"CV RMSE: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
最佳實踐
Data Preparation
- Always add constant: Use
sm.add_constant()unless excluding intercept - Check for missing values: Handle or impute before fitting
- Scale if needed: Improves convergence, interpretation (but not required for tree models)
- Encode categoricals: Use formula API or manual dummy coding
Model Building
- Start simple: Begin with basic model, add complexity as needed
- Check assumptions: Test residuals, heteroskedasticity, autocorrelation
- Use appropriate model: Match model to outcome type (binary→Logit, count→Poisson)
- Consider alternatives: If assumptions violated, use robust methods or different model
Inference
- Report effect sizes: Not just p-values
- Use robust SEs: When heteroskedasticity or clustering present
- Multiple comparisons: Correct when testing many hypotheses
- Confidence intervals: Always report alongside point estimates
Model Evaluation
- Check residuals: Plot residuals vs fitted, Q-Q plot
- Influence diagnostics: Identify and investigate influential observations
- Out-of-sample validation: Test on holdout set or cross-validate
- Compare models: Use AIC/BIC for non-nested, LR test for nested
Reporting
- Comprehensive summary: Use
.summary()for detailed output - Document decisions: Note transformations, excluded observations
- Interpret carefully: Account for link functions (e.g., exp(β) for log link)
- Visualize: Plot predictions, confidence intervals, diagnostics
Common Workflows
Workflow 1: Linear Regression Analysis
- Explore data (plots, descriptives)
- Fit initial OLS model
- Check residual diagnostics
- Test for heteroskedasticity, autocorrelation
- Check for multicollinearity (VIF)
- Identify influential observations
- Refit with robust SEs if needed
- Interpret coefficients and inference
- Validate on holdout or via CV
Workflow 2: Binary Classification
- Fit logistic regression (Logit)
- Check for convergence issues
- Interpret odds ratios
- Calculate marginal effects
- Evaluate classification performance (AUC, confusion matrix)
- Check for influential observations
- Compare with alternative models (Probit)
- Validate predictions on test set
Workflow 3: Count Data Analysis
- Fit Poisson regression
- Check for overdispersion
- If overdispersed, fit Negative Binomial
- Check for excess zeros (consider ZIP/ZINB)
- Interpret rate ratios
- Assess goodness of fit
- Compare models via AIC
- Validate predictions
Workflow 4: Time Series Forecasting
- Plot series, check for trend/seasonality
- Test for stationarity (ADF, KPSS)
- Difference if non-stationary
- Identify p, q from ACF/PACF
- Fit ARIMA or SARIMAX
- Check residual diagnostics (Ljung-Box)
- Generate forecasts with confidence intervals
- Evaluate forecast accuracy on test set
Reference Documentation
This skill includes comprehensive reference files for detailed guidance:
references/linear_models.md
Detailed coverage of linear regression models including:
- OLS, WLS, GLS, GLSAR, Quantile Regression
- Mixed effects models
- Recursive and rolling regression
- Comprehensive diagnostics (heteroskedasticity, autocorrelation, multicollinearity)
- Influence statistics and outlier detection
- Robust standard errors (HC, HAC, cluster)
- Hypothesis testing and model comparison
references/glm.md
Complete guide to generalized linear models:
- All distribution families (Binomial, Poisson, Gamma, etc.)
- Link functions and when to use each
- Model fitting and interpretation
- Pseudo R-squared and goodness of fit
- Diagnostics and residual analysis
- Applications (logistic, Poisson, Gamma regression)
references/discrete_choice.md
Comprehensive guide to discrete outcome models:
- Binary models (Logit, Probit)
- Multinomial models (MNLogit, Conditional Logit)
- Count models (Poisson, Negative Binomial, Zero-Inflated, Hurdle)
- Ordinal models
- Marginal effects and interpretation
- Model diagnostics and comparison
references/time_series.md
In-depth time series analysis guidance:
- Univariate models (AR, ARIMA, SARIMAX, Exponential Smoothing)
- Multivariate models (VAR, VARMAX, Dynamic Factor)
- State space models
- Stationarity testing and diagnostics
- Forecasting methods and evaluation
- Granger causality, IRF, FEVD
references/stats_diagnostics.md
Comprehensive statistical testing and diagnostics:
- Residual diagnostics (autocorrelation, heteroskedasticity, normality)
- Influence and outlier detection
- Hypothesis tests (parametric and non-parametric)
- ANOVA and post-hoc tests
- Multiple comparisons correction
- Robust covariance matrices
- Power analysis and effect sizes
When to reference:
- Need detailed parameter explanations
- Choosing between similar models
- Troubleshooting convergence or diagnostic issues
- Understanding specific test statistics
- Looking for code examples for advanced features
Search patterns:
# Find information about specific models
grep -r "Quantile Regression" references/
# Find diagnostic tests
grep -r "Breusch-Pagan" references/stats_diagnostics.md
# Find time series guidance
grep -r "SARIMAX" references/time_series.md
Common Pitfalls to Avoid
- Forgetting constant term: Always use
sm.add_constant()unless no intercept desired - Ignoring assumptions: Check residuals, heteroskedasticity, autocorrelation
- Wrong model for outcome type: Binary→Logit/Probit, Count→Poisson/NB, not OLS
- Not checking convergence: Look for optimization warnings
- Misinterpreting coefficients: Remember link functions (log, logit, etc.)
- Using Poisson with overdispersion: Check dispersion, use Negative Binomial if needed
- Not using robust SEs: When heteroskedasticity or clustering present
- Overfitting: Too many parameters relative to sample size
- Data leakage: Fitting on test data or using future information
- Not validating predictions: Always check out-of-sample performance
- Comparing non-nested models: Use AIC/BIC, not LR test
- Ignoring influential observations: Check Cook's distance and leverage
- Multiple testing: Correct p-values when testing many hypotheses
- Not differencing time series: Fit ARIMA on non-stationary data
- Confusing prediction vs confidence intervals: Prediction intervals are wider
Getting Help
For detailed documentation and examples:
- Official docs: https://www.statsmodels.org/stable/
- User guide: https://www.statsmodels.org/stable/user-guide.html
- Examples: https://www.statsmodels.org/stable/examples/index.html
- API reference: https://www.statsmodels.org/stable/api.html