在Python中没有R语言中的MASS::stepAIC 方法 进行逐步回归
逐步回归是一种贪心算法,它每次迭代选择一个最优的特征加入模型或从模型中删除一个特征,直到最终得到一个最优的子集。有两种逐步回归算法,一种是前向逐步回归(forward stepwise regression),另一种是后向逐步回归(backward stepwise regression)。
但我们可以使用暴力的方法不断求出模型的AIC
以下是使用前向逐步回归进行特征选择的代码示例:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from itertools import combinations
from sklearn.metrics import mean_squared_error
# Load Boston Housing dataset
boston = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
# Split dataset into features (X) and target (y)
X = boston.drop('medv', axis=1)
y = boston['medv']
# Add constant to features (for intercept)
X = sm.add_constant(X)
# Create list of all possible feature combinations (excluding constant)
feature_combinations = [c for i in range(1, len(X.columns)) for c in combinations(X.columns[1:], i)]
# Initialize best AIC and best feature combination
best_aic = np.inf
best_features = None
# Loop through all feature combinations and calculate AIC
for features in feature_combinations:
# Fit OLS model with current feature combination
model = sm.OLS(y, X[list(features) + ['const']])
results = model.fit()
# Calculate AIC for current model
aic = results.aic
# Update best AIC and best feature combination if current model has lower AIC
if aic < best_aic:
best_aic = aic
best_features = features
# Fit final OLS model with best feature combination
final_model = sm.OLS(y, X[list(best_features) + ['const']])
final_results = final_model.fit()
# Print summary of final model
print(final_results.summary())
这个函数接受输入矩阵X和目标向量y,并返回具有最小AIC的最佳模型。它使用一个循环来逐个添加特征,并在每次迭代中计算AIC。它返回一个拟合最好的模型。
需要注意的是,这个算法可能会产生过拟合的问题,因为它在每次迭代中选择一个最优的特征,而不是考虑所有特征的组合。因此,建议使用交叉验证或其他方法来评估特征选择的性能。