简述python的特点交互式_第42集 python机器学习：交互式特征和多项式特征

最新推荐文章于 2024-06-27 11:12:05 发布

weixin_39690972

最新推荐文章于 2024-06-27 11:12:05 发布

阅读量459

点赞数

文章标签：简述python的特点交互式

想要丰富的特征表示，特别是对于线性模型而言，另一种方法是添加原始数据的交互特征(interaction feature)和多项式特征(polynomial feature)，这种特征工程通常用于统计建模，也常用语实际的机器学习中。

线性模型不仅可以用于学习偏移，也可以用于学习斜率。想要向分箱数据上的线性模型添加斜率，一种方法是重新加入原始特征，如下代码所示：

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor

from mglearn.datasets import make_wave

x, y = make_wave(n_samples=150)

line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)

reg = DecisionTreeRegressor(min_samples_split=3).fit(x, y)

plt.plot(x[:, 0], y, 'o', c='k')

plt.ylabel("Regression output")

plt.xlabel("Input feature")

plt.legend(loc="best")

#在-3-3之间创建距离相等的10个箱子

bins = np.linspace(-3, 3, 11)

print("bins: {}".format(bins))

#然后我们记录每个数据点所属的箱子，可以通过np.digitize公式计算出来、

which_bin = np.digitize(x, bins=bins)

print("\nData points:\n", x[:5])

print("\nBin membership for data points:\n", which_bin[:5])

from sklearn.preprocessing import OneHotEncoder

#使用OneHotEncoder进行变换

encoder = OneHotEncoder(sparse=False)

#encoder.fit找到which_bin中的唯一值

encoder.fit(which_bin)

#transform创建one-hot编码

x_binned = encoder.transform(which_bin)

print(x_binned[:5])

print("x_binned.shape: {}".format(x_binned.shape))

#下面我们在one-hot编码后的数据上构建新的线性模型和新的决策树模型。

line_binned = encoder.transform(np.digitize(line, bins=bins))

reg = LinearRegression().fit(x_binned, y)

plt.plot(line, reg.predict(line_binned), label = 'Decision tree binned')

reg = DecisionTreeRegressor(min_samples_split=3).fit(x_binned, y)

plt.plot(line, reg.predict(line_binned), label='Linear regression binned')

plt.plot(x[:, 0], y, 'o', c='k')

plt.vlines(bins, 3, -3, linewidth=1, alpha=.2)

plt.ylabel("Regression output")

plt.xlabel("Input feature")

x_combined = np.hstack([x, x_binned])

print(x_combined.shape)

reg = LinearRegression().fit(x_combined, y)

line_combined = np.hstack([line, line_binned])

plt.plot(line, reg.predict(line_combined), label='Linear regression combined')

for bin in bins:

plt.plot([bin, bin], [-3, 3], ':', c='k')

plt.legend(loc="best")

plt.ylabel("Regression output")

plt.xlabel("Input feature")

plt.plot(x[:, 0], y, 'o', c='k')

上述代码运行后部分结果如图所示：

使用分箱特征和单一全局斜率的线性回归

在这个例子中，模型在每个箱子中学到一个偏移量，还学到一个斜率。学到的斜率是向下的，并且在所有箱子中都相同——只有一个x特征，也就只有一个斜率。

因为斜率在所有箱子中都是相同的，所以它们似乎不是很有用，我们更希望每个箱子都有一个不同的斜率，为了实现这一点，我们可以添加交互特征或乘积特征，用来表示数据点所在的箱子以及数据在x轴上的位置，这个特征是箱子指示符与原始特征的乘积，下面我们来创建数据集并将其可视化：

x_product = np.hstack([x_binned, x*x_binned])

print(x_product.shape)

reg = LinearRegression().fit(x_product, y)

line_product = np.hstack([line_binned, line*line_binned])

plt.plot(line, reg.predict(line_product), label='linear regression product')

for bin in bins:

plt.plot([bin, bin], [-3, 3], ':', c='k')

plt.plot(x[:, 0], y, 'o', c='k')

plt.ylabel("Regression output")

plt.xlabel("Input feature")

plt.legend(loc="best")

运行后其结果如下

：

每个箱子具有不同斜率的回归

从运行结果可知，现在这个模型中每个箱子的斜率不再相同，使用分箱是扩展连续特征的一种方法，另一种方法是使用原始特征的多项式，对于给定的特征x，我们可以进行变换，如x**2， x**3等等，这在preprocession模块的PolynomialFeatures中实现。

from sklearn.preprocessing import PolynomialFeatures

#包含直到x**10的多项式,默认的“include_bias=True”添加恒等于1的常数特征

poly = PolynomialFeatures(degree=10, include_bias=False)

poly.fit(x)

x_poly = poly.transform(x)

#多项式次数为10，所以生成了10个特征：

print("x_poly.shape:{}".format(x_poly.shape))

#比较x和x_poly元素

print("Entries of x:{}".format(x[:5]))

print("Entries of x_poly: {}".format(x_poly[:, 5]))

#调用get_feature_names方法来获取特征的寓意，给出每个特征的指数

print("Poly feature names:\n{}".format(poly.get_feature_names))

运行后部分结果如下：

具有10个特征多项式的部分结果

将多项式特征与线性回归模型一起使用，我们可以得到经典的多项式回归模型：

reg = LinearRegression().fit(x_poly, y)

line_poly = poly.transform(line)

plt.plot(line, reg.predict(line_poly), label='polynomial linear regression')

plt.plot(x[:, 0], y, 'o', c='k')

plt.xlabel("Input value")

plt.ylabel("output regression value")

plt.legend(loc="best")

运行后结果为：

具有10次多项式特征的线性回归

由运行结果可以看出，多项式特征在这个一维数据集上得到了非常平滑的拟合，但是高次数多项式在边界上或数据很少的区域可能会有极端表现，下面我们在原始数据上不使用任何变换应用SVM来对比。

from sklearn.svm import SVR

for gamma in [1, 10]:

svr = SVR(gamma=gamma).fit(x, y)

plt.plot(line, svr.predict(line), label='SVR gamma={}'.format(gamma))

plt.plot(x[:,0], y, 'o', c='k')

plt.xlabel("input feature value")

plt.ylabel("output regression value")

plt.legend(loc="best")

运行后结果如图：

对于RBF核的SVM，使用不同的gamma参数的对比

由运行结果可以看出，使用更加复杂的模型(核SVM)，我们能够学到一个与多项式回归复杂度类似的预测结果，且不需要进行显示的特征变换。

下次我们将以波士顿房价数据集来进行简单的说明对比。

weixin_39690972

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫