Feature Engineering Made Easy 笔记
Chapter 5 Feature Selection
预测性能指标
- 真阳性率和假阳性率
- 绝对平均误差(回归)
-
R
2
=
1
−
S
S
r
e
s
S
S
t
o
t
a
l
R^2 = 1-\frac{SS_{res}}{SS_{total}}
R2=1−SStotalSSres coefficient of determination
S S r e s = ∑ ( y i − y ^ ) 2 SS_{res} = \sum(y_i-\widehat{y})^2 SSres=∑(yi−y )2
S S t o t a l = ∑ ( y i − y ‾ ) 2 SS_{total} = \sum(y_i-\overline{y})^2 SStotal=∑(yi−y)2
模型越好,越接近1 - 训练时间
- 预测新数据时间
- …
基于统计的特征选择
- Pearson Correlation C o v ( x , y ) σ x σ y \frac{Cov(x,y)}{\sigma_x \sigma_y} σxσyCov(x,y) 衡量两个变量的线性关系。
- Hypothesis Test:
Anova F value
chi-squared
使用SelectKBest
k_best = SelectKBest(f_classif, k=5)
k_best.fit_transform(X, y)
基于机器学习模型选择
使用机器学习 SelectFromModel
Decision Tree
logisitic regression(针对回归任务)
SVC(针对二分数据集)
tree = DecisionTreeClassifier()
tree.fit(X, y)
tree_pipe_params = {'classifier__max_depth': [1, 3, 5, 7]}
importances = pd.DataFrame({'importance': tree.feature_importances_, 'feature':X.columns}).sort_values('importance', ascending=False)
tree_pipe_params = {'classifier__max_depth': [1, 3, 5, 7]}
select_from_model = SelectFromModel(DecisionTreeClassifier(),
threshold=.05)
selected_X = select_from_model.fit_transform(X, y)
from sklearn.pipeline import Pipeline
select_from_pipe = Pipeline([('select', SelectFromModel(DecisionTreeClassifier())),
('classifier', d_tree)])
select_from_pipe_params = deepcopy(tree_pipe_params)
select_from_pipe_params.update({
'select__threshold': [.01, .05, .1, "mean", "median", "2.*mean"],
'select__estimator__max_depth': [None, 1, 3, 5, 7]
})
print select_from_pipe_params
# not better than original
get_best_model_and_accuracy(select_from_pipe,
select_from_pipe_params,
X, y)