二手车价格预测模型笔记(调用库)

1. 导入所需库

首先,导入所需的库和模块:

%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
import torch

2. 数据读取与预处理

读取训练集和测试集数据:

train_data = pd.read_csv('used_car_train_20200313.csv', sep=' ') test_data = pd.read_csv('used_car_testB_20200421.csv', sep=' ') test_data.to_csv('used_car_testB.csv') train_data.to_csv('used_car_train.csv')

将训练集和测试集数据合并以便统一处理:

data = pd.concat([train_data, test_data]) data = data.replace('-', '-1') data.notRepairedDamage = data.notRepairedDamage.astype('float32') data.loc[data['power'] > 600, 'power'] = 600

定义类别特征和数值特征:

cate_cols = ['model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'seller', 'notRepairedDamage'] num_cols = ['regDate', 'creatDate', 'power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14']

定义一个用于处理类别特征的One-Hot编码函数:

def oneHotEncode(df, colNames): for col in colNames: dummies = pd.get_dummies(df[col], prefix=col) df = pd.concat([df, dummies], axis=1) df.drop([col], axis=1, inplace=True) return df

处理离散数据和连续数据:

for col in cate_cols: data[col] = data[col].fillna('-1') data = oneHotEncode(data, cate_cols) for col in num_cols: data[col] = data[col].fillna(0) data[col] = (data[col] - data[col].min()) / (data[col].max() - data[col].min())

删除无关数据列:

data.drop(['name', 'regionCode'], axis=1, inplace=True) data = data.reset_index(drop=True) data = data.astype(float)

3. 数据集分离

提取测试集:

test_data = data[pd.isna(data.price)] X_id = test_data['SaleID'] del test_data['SaleID'] del test_data['price'] X_result = torch.tensor(test_data.values, dtype=torch.float32) test_data.to_csv('one_hot_testB.csv')

提取训练集:

train_data = data.drop(data[pd.isna(data.price)].index) train_data.to_csv('one_hot_train.csv') y = train_data['price'] del train_data['price'] del train_data['SaleID'] X = torch.tensor(train_data.values, dtype=torch.float32) y = torch.Tensor(y)

将数据集划分为训练集和测试集:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=512)

4. 模型训练与评估

训练随机森林回归模型和线性回归模型:

lr1 = RandomForestRegressor().fit(X_train, y_train) lr2 = LinearRegression().fit(X_train, y_train)

输出模型在训练集和测试集上的得分:

print('训练集得分:{:.3f}'.format(lr1.score(X_train, y_train))) print('测试集得分:{:.3f}'.format(lr1.score(X_test, y_test)))

5. 结果保存

将结果保存到文件中:

submission = pd.concat([X_id, res['price']], axis=1) submission.to_csv('submission.csv', index=False)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值