"数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。" 特征构造与特征选择十分重要,本文学习了一些特征构造和选择方法。
一、特征构造
# 合并车站
data['bus_sub_num'] = data['subwayStationNum']+data['busStationNum']
# 合并学校
data['school_num'] = data['interSchoolNum']+data['schoolNum']+data['privateSchoolNum']
# 合并医疗
data['help_sum'] = data['hospitalNum']+data['drugStoreNum']
# 合并生活设施
data['play_sum'] = data['gymNum']+data['parkNum']+data['bankNum']
# 合并购物
data['shop_num'] = data['shopNum']+data['mallNum']+data['superMarketNum']
# 其他合并
data['totalNewTradeMoney_Workers'] = data['totalNewTradeMoney'] + data['totalWorkers']
data['bankNum_Workers'] = data['bankNum'] + data['totalWorkers']
data['gym_bankNum'] = data['bankNum'] + data['gymNum']
# "板块二手房价"
data['area_mean_price'] = (data['area']*data['tradeMeanPrice'])/1000
# "板块新房房价"
data['New_area_mean_price'] = (data['area']*data['tradeNewMeanPrice'])/1000
二、特征选择
1. Filter
(1)信息增益
(2)相关系数
(3)卡方检验
from sklearn.feature_selection import SelectKBest,SelectPercentile
from sklearn.feature_selection import chi2
X_new = SelectKBest(chi2, k=43).fit(X, y).get_support(indices = True)
2. Wrapper
(1)递归特征消除法(RFE)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select=40)
X = train_data.drop(["tradeMoney"],axis=1)
y = train_data["tradeMoney"]
rfe.fit(X,y)
rfe.ranking_,rfe.n_features_,rfe.support_
sel_features = [f for f, s in zip(X_columns, rfe.support_) if s]
3. Embedded
(1)基于惩罚项的特征选择法
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=5)
ridge.fit(X,y)
coefSort = ridge.coef_.argsort()
featureCoefSore=ridge.coef_[coefSort]
X_columns[coefSort]
sel_features = [f for f, s in zip(X_columns, featureCoefSore) if abs(s)> 2 ]
train = train_data[sel_features]
test = test_data[sel_features]
(2)基于树模型的特征选择法
(3)随机森林 平均不纯度减少(mean decrease impurity)
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
# 训练随机森林模型,并通过feature_importances_属性获取每个特征的重要性分数。rf = RandomForestRegressor()
rf.fit(X, y)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), X_columns),
reverse=True))
sel_features = [f for f, s in zip(X_columns, rf.feature_importances_) if abs(s)> 0.001 ] # 选择绝对值大于二的特征