目录
我们使用随机森林处理共享单车回归模型,主要包含:数据集划分,建立日期特征,对缺失值进行填补。
1 模型1:baseline
我们只使用最基础的模型,不做任何处理
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
np.random.seed(123)
1.1 导入数据,划分特征-标签,划分训练集-测试集
data = pd.read_csv("train.csv")
x = data[["season", "holiday", "workingday", "weather", "temp", "atemp","humidity","windspeed"]]
y = data["count"]
X_train, X_test, y_train, y_test = train_test_split(x, y,
train_size=0.7,
shuffle=False)
1.2 使用随机森林训练和验证模型
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("mse:{}".format(mean_squared_error(y_test, y_pred)))
## mse:43578.72115559897
2 模型2:增加时间维度特征
在本节,我们从新读取数据,并且将时间维度进行拆分,拆分为:年,月,小时,星期。
2.1 数据读取,特征处理
data = pd.read_csv("train.csv")
data["date"] = data.datetime.apply(lambda x: x.split()[0])
data["hour"] = data.datetime.apply(lambda x: x.split()[1].split(":")[0]).astype("int")
data["year"] = data.datetime.apply(lambda x: x.split()[0].split("-")[0])
data["weekday"] = data.date.apply(lambda dateString: datetime.strptime(dateString, "%Y-%m-%d").weekday())
data["month"] = data.date.apply(lambda dateString: datetime.strptime(dateString, "%Y-%m-%d").month)
x = data[["hour","year","weekday","month","season", "holiday", "workingday", "weather", "temp", "atemp","humidity","windspeed"]]
y = data["count"]
X_train, X_test, y_train, y_test = train_test_split(x, y,
train_size=0.7,
shuffle=False)
2.2 使用随机森林训练和验证模型
rf = RandomForestRegressor(n_estimators=100,random_state=123)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("mse:{}".format(mean_squared_error(y_test, y_pred)))
## mse:5866.200720560327
通过结果,我们发现在使用了时间维度特征,效果提升很多。
3 模型3:对weedspeed进行填补
通过数据分析,我们发现weedspeed这一列有好多0值,所以,我们使用算法对其进行插补。我们将数据拆分为训练集、测试集,并且挑选出来一部分特征作为特征值,选出weedspeed这一列作为预测值。
3.1 导入数据,增加时间维度
data = pd.read_csv("train.csv")
data["date"] = data.datetime.apply(lambda x: x.split()[0])
data["hour"] = data.datetime.apply(lambda x: x.split()[1].split(":")[0]).astype("int")
data["year"] = data.datetime.apply(lambda x: x.split()[0].split("-")[0])
data["weekday"] = data.date.apply(lambda dateString: datetime.strptime(dateString, "%Y-%m-%d").weekday())
data["month"] = data.date.apply(lambda dateString: datetime.strptime(dateString, "%Y-%m-%d").month)
3.2 使用随机森林填补windspeed这一列的缺失值
dataWind0 = data[data["windspeed"] == 0]
dataWindNot0 = data[data["windspeed"] != 0]
rfModel_wind = RandomForestRegressor()
windColumns = ["season", "weather", "humidity", "month", "temp", "year", "atemp"]
rfModel_wind.fit(dataWindNot0[windColumns], dataWindNot0["windspeed"])
wind0Values = rfModel_wind.predict(X=dataWind0[windColumns])
dataWind0["windspeed"] = wind0Values
data = dataWindNot0.append(dataWind0)
data.reset_index(inplace=True)
data.drop('index', inplace=True, axis=1)
3.3 拆分数据集
x = data[["hour","year","weekday","month","season", "holiday", "workingday", "weather", "temp", "atemp","humidity","windspeed"]]
y = data["count"]
X_train, X_test, y_train, y_test = train_test_split(x, y,
train_size=0.7,
shuffle=False)
3.4 使用随机森林训练和验证模型
rf = RandomForestRegressor(n_estimators=100,random_state=123)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("mse:{}".format(mean_squared_error(y_test, y_pred)))
##mse:4048.247575719535
通过实验,我们发现使用缺失值进行填补,可以提升模型的性能。
4515

被折叠的 条评论
为什么被折叠?



