【笔记】【机器学习基础】交互特征与多项式特征

最新推荐文章于 2024-06-27 11:12:05 发布

'VeNus

最新推荐文章于 2024-06-27 11:12:05 发布

阅读量334

点赞数 1

分类专栏：读书笔记文章标签：机器学习 python 深度学习

本文链接：https://blog.csdn.net/qq_47809408/article/details/125227057

版权

读书笔记专栏收录该内容

82 篇文章 5 订阅

订阅专栏

想要丰富特征表示，尤其是对于线性模型而言，可以通过添加原始数据的交互特征和多项式特征。
（1）使用分箱特征和单一全局斜率的线性回归

X_combined = np.hstack([X, X_binned])
print(X_combined.shape)

reg = LinearRegression().fit(X_combined, y)

line_combined = np.hstack([line, line_binned])
plt.plot(line, reg.predict(line_combined), label='linear regression combined')

plt.vlines(kb.bin_edges_[0], -3, 3, linewidth=1, alpha=.2)
plt.legend(loc="best")
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.plot(X[:, 0], y, 'o', c='k')

在这里插入图片描述
该例子中，模型中每个箱子都学到一个偏移和斜率，但是每个箱子的斜率相同。

（2）添加交互特征或乘积特征
通过添加交互特征或乘积特征，用来表示数据点所在的箱子及数据点在x轴的位置
1、创建数据集

X_product = np.hstack([X_binned, X * X_binned])
print(X_product.shape)

2、每个箱子具有不同斜率的线性回归

reg = LinearRegression().fit(X_product, y)

line_product = np.hstack([line_binned, line * line_binned])
plt.plot(line, reg.predict(line_product), label='linear regression product')

plt.vlines(kb.bin_edges_[0], -3, 3, linewidth=1, alpha=.2)

plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")

在这里插入图片描述
（3）使用原始特征的多项式
扩展连续特征：
1、分箱
2、使用原始特征的多项式

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=10, include_bias=False)
poly.fit(X)
X_poly = poly.transform(X)

print("X_poly.shape: {}".format(X_poly.shape))

比较X_poly和X的元素

print("Entries of X:\n{}".format(X[:5]))
print("Entries of X_poly:\n{}".format(X_poly[:5]))

在这里插入图片描述
调用get_feature_names获取特征的语义，给出每个特征的指数

print("Polynomial feature names:\n{}".format(poly.get_feature_names()))

Polynomial feature names:
[‘x0’, ‘x0^2’, ‘x0^3’, ‘x0^4’, ‘x0^5’, ‘x0^6’, ‘x0^7’, ‘x0^8’, ‘x0^9’, ‘x0^10’]

将多项式与线性回归模型一起使用，得到多项式回归

reg = LinearRegression().fit(X_poly, y)

line_poly = poly.transform(line)
plt.plot(line, reg.predict(line_poly), label='polynomial linear regression')
plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")

在这里插入图片描述
使用核SVM进行对比

from sklearn.svm import SVR

for gamma in [1, 10]:
    svr = SVR(gamma=gamma).fit(X, y)
    plt.plot(line, svr.predict(line), label='SVR gamma={}'.format(gamma))

plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")

在这里插入图片描述
可以发现使用更复杂的核svm模型，也能得到与多项式回归类似的预测，且不需要进行显式的特征变换

（4）观察多项式特征的构造方式及用处（使用波士顿房价数据集）

加载数据，利用MinMaxScaler将其缩放

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
    boston.data, boston.target, random_state=0)

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

提取多项式特征和交互特征

poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly = poly.transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_poly.shape: {}".format(X_train_poly.shape))

原始数据13个被扩展到105个交互特征，新特征表示两个不同原始特征之间所有可能的交互项，以及每个原始特征的平方
degree=2 需要由最多两个原始特征的乘积组成的所有特征

get_feature_names得到输入特征和输出特征之间的确切对应关系

print("Polynomial feature names:\n{}".format(poly.get_feature_names()))

在这里插入图片描述

对Ridge在有交互特征的数据上和没有交互特征的数据性能进行对比

from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train_scaled, y_train)
print("Score without interactions: {:.3f}".format(
    ridge.score(X_test_scaled, y_test)))
ridge = Ridge().fit(X_train_poly, y_train)
print("Score with interactions: {:.3f}".format(
    ridge.score(X_test_poly, y_test)))

Score without interactions: 0.621
Score with interactions: 0.753

使用Ridge时，交互特征和多项式特征对性能有很大的提升，但是使用更加复杂的模型（如随机森林）会有不同

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100).fit(X_train_scaled, y_train)
print("Score without interactions: {:.3f}".format(
    rf.score(X_test_scaled, y_test)))
rf = RandomForestRegressor(n_estimators=100).fit(X_train_poly, y_train)
print("Score with interactions: {:.3f}".format(rf.score(X_test_poly, y_test)))