1. 导入所需库
首先,导入所需的库和模块:
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
import torch
2. 数据读取与预处理
读取训练集和测试集数据:
train_data = pd.read_csv('used_car_train_20200313.csv', sep=' ') test_data = pd.read_csv('used_car_testB_20200421.csv', sep=' ') test_data.to_csv('used_car_testB.csv') train_data.to_csv('used_car_train.csv')
将训练集和测试集数据合并以便统一处理:
data = pd.concat([train_data, test_data]) data = data.replace('-', '-1') data.notRepairedDamage = data.notRepairedDamage.astype('float32') data.loc[data['power'] > 600, 'power'] = 600
定义类别特征和数值特征:
cate_cols = ['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'seller', 'notRepairedDamage'] num_cols = ['regDate', 'creatDate', 'power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']
定义一个用于处理类别特征的One-Hot编码函数:
def oneHotEncode(df, colNames): for col in colNames: dummies = pd.get_dummies(df[col], prefix=col) df = pd.concat([df, dummies], axis=1) df.drop([col], axis=1, inplace=True) return df
处理离散数据和连续数据:
for col in cate_cols: data[col] = data[col].fillna('-1') data = oneHotEncode(data, cate_cols) for col in num_cols: data[col] = data[col].fillna(0) data[col] = (data[col] - data[col].min()) / (data[col].max() - data[col].min())
删除无关数据列:
data.drop(['name', 'regionCode'], axis=1, inplace=True) data = data.reset_index(drop=True) data = data.astype(float)
3. 数据集分离
提取测试集:
test_data = data[pd.isna(data.price)] X_id = test_data['SaleID'] del test_data['SaleID'] del test_data['price'] X_result = torch.tensor(test_data.values, dtype=torch.float32) test_data.to_csv('one_hot_testB.csv')
提取训练集:
train_data = data.drop(data[pd.isna(data.price)].index) train_data.to_csv('one_hot_train.csv') y = train_data['price'] del train_data['price'] del train_data['SaleID'] X = torch.tensor(train_data.values, dtype=torch.float32) y = torch.Tensor(y)
将数据集划分为训练集和测试集:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=512)
4. 模型训练与评估
训练随机森林回归模型和线性回归模型:
lr1 = RandomForestRegressor().fit(X_train, y_train) lr2 = LinearRegression().fit(X_train, y_train)
输出模型在训练集和测试集上的得分:
print('训练集得分:{:.3f}'.format(lr1.score(X_train, y_train))) print('测试集得分:{:.3f}'.format(lr1.score(X_test, y_test)))
5. 结果保存
将结果保存到文件中:
submission = pd.concat([X_id, res['price']], axis=1) submission.to_csv('submission.csv', index=False)