1.采用箱线图删除异常值
2.特征构造
- data['creatDate'] - data['regDate'] 构造汽车使用时间特征
- regionCode -> city 根据邮编构造城市特征
- 根据汽车品牌构造销售统计量特征
3.归一化
power log归一化
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data['power'] = np.log(data['power'] + 1)
data['power'] = ((data['power'] - np.min(data['power'])) / (np.max(data['power']) - np.min(data['power'])))
data['power'].plot.hist()
kilometer归一化
4.特征筛选
利用相关性筛选特征
data_numeric = data[['power', 'kilometer', 'brand_amount', 'brand_price_average',
'brand_price_max', 'brand_price_median']]
correlation = data_numeric.corr()
f, ax = plt.subplots(figsize = (7, 7))
plt.title('Correlation of Numeric Features with Price', y=1, size=16)
sns.heatmap(correlation, square=True, vmax=0.8)