【笔记】【机器学习基础】专家知识

'VeNus

于 2022-06-11 18:46:35 发布

阅读量372

点赞数 1

分类专栏：读书笔记文章标签：机器学习 python 深度学习

本文链接：https://blog.csdn.net/qq_47809408/article/details/125113057

版权

读书笔记专栏收录该内容

82 篇文章 4 订阅

订阅专栏

一个特例：预测在Andreas家门口的自行车出租

（1）加载数据，重新采样

citibike = mglearn.datasets.load_citibike()

print("Citi Bike data:\n{}".format(citibike.head()))

Citibike data:
starttime
2015-08-01 00:00:00 3
2015-08-01 03:00:00 0
2015-08-01 06:00:00 9
2015-08-01 09:00:00 41
2015-08-01 12:00:00 39
Freq: 3H, Name: one, dtype: int64

（2）整个月租车数量的可视化

plt.figure(figsize=(10, 3))
xticks = pd.date_range(start=citibike.index.min(), end=citibike.index.max(),
                       freq='D')
plt.xticks(xticks, xticks.strftime("%a %m-%d"), rotation=90, ha="left")
plt.plot(citibike, linewidth=1)
plt.xlabel("Date")
plt.ylabel("Rentals")

在这里插入图片描述

对时间序列上的预测，希望从过去学习并预测未来
在划分训练集和测试集的时候，我们希望使用某个特定日期之前的所有数据作为训练集，该日期之后的所有数据作为测试集

（3）用单一整数特征作为数据表示

y = citibike.values

X = citibike.index.astype("int64").values.reshape(-1, 1) // 10**9

（4）定义函数，将数据划分，构建模型并可视化

n_train = 184

def eval_on_features(features, target, regressor):
    # split the given features into a training and a test set
    X_train, X_test = features[:n_train], features[n_train:]
    # also split the target array
    y_train, y_test = target[:n_train], target[n_train:]
    regressor.fit(X_train, y_train)
    print("Test-set R^2: {:.2f}".format(regressor.score(X_test, y_test)))
    y_pred = regressor.predict(X_test)
    y_pred_train = regressor.predict(X_train)
    plt.figure(figsize=(10, 3))

    plt.xticks(range(0, len(X), 8), xticks.strftime("%a %m-%d"), rotation=90,
               ha="left")

    plt.plot(range(n_train), y_train, label="train")
    plt.plot(range(n_train, len(y_test) + n_train), y_test, '-', label="test")
    plt.plot(range(n_train), y_pred_train, '--', label="prediction train")

    plt.plot(range(n_train, len(y_test) + n_train), y_pred, '--',
             label="prediction test")
    plt.legend(loc=(1.01, 0))
    plt.xlabel("Date")
    plt.ylabel("Rentals")

随机森林需要很少的数据预处理，它很适合作为第一个模型
使用POSIX时间特征X，并将随机森林回归传入eval_on_features函数

（5）随机森林仅使用POSIX时间做出预测

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100, random_state=0)
eval_on_features(X, y, regressor)

在这里插入图片描述
在训练集上预测结果相当好，这符合随机森林通常的表现。但对于测试集来说，预测结果是一条常数直线，说明什么都没有学到。

问题：特征和随机森林的组合
测试集中POSIX时间特征的值超出了训练集中特征的取值范围：测试集中的数据点的时间戳要晚于训练集中的所有数据点。树以及随机森林无法外推到训练集之外的特征范围。结果就是模型只能预测训练集中最近数据点的目标值，即最后一天观测到数据的时间。

通过观察训练集中租车数量的图像，我们发现两个因素非常重要：一天内的时间与一周的星期几。

（6）随机森林仅使用每天的时刻做出的预测

X_hour = citibike.index.hour.values.reshape(-1, 1)
eval_on_features(X_hour, y, regressor)

在这里插入图片描述
（7）随机森林是使用一周的星期几和每天的时刻两个特征做出预测

X_hour_week = np.hstack([citibike.index.dayofweek.values.reshape(-1, 1),
                         citibike.index.hour.values.reshape(-1, 1)])
eval_on_features(X_hour_week, y, regressor)

在这里插入图片描述
（8）线性模型使用一周的星期几和每天的时刻两个特征做出预测

from sklearn.linear_model import LinearRegression
eval_on_features(X_hour_week, y, LinearRegression())

在这里插入图片描述
（9）线性模型使用one-hot编码过的两个特征做出预测

enc = OneHotEncoder()
X_hour_week_onehot = enc.fit_transform(X_hour_week).toarray()

eval_on_features(X_hour_week_onehot, y, Ridge())

在这里插入图片描述
获得比连续特征编码好得多的匹配。

（10）线性模型使用两个特征的乘积做出预测
利用交互特征让模型为两个特征的每个组合学到一个系数

poly_transformer = PolynomialFeatures(degree=2, interaction_only=True,
                                      include_bias=False)
X_hour_week_onehot_poly = poly_transformer.fit_transform(X_hour_week_onehot)
lr = Ridge()
eval_on_features(X_hour_week_onehot_poly, y, lr)

在这里插入图片描述
得到与随机森林类似的模型
优点：可以清楚看到学习的内容（对每个星期几和时刻的交互项学到一个系数），随机森林做不到

（11）线性模型使用两个特征的乘积学到的系数

hour = ["%02d:00" % i for i in range(0, 24, 3)] day = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"] features =  day + hour


features_poly = poly_transformer.get_feature_names(features) features_nonzero = np.array(features_poly)[lr.coef_ != 0] coef_nonzero
= lr.coef_[lr.coef_ != 0]

plt.figure(figsize=(15, 2)) plt.plot(coef_nonzero, 'o') plt.xticks(np.arange(len(coef_nonzero)), features_nonzero, rotation=90) plt.xlabel("Feature name") plt.ylabel("Feature magnitude")