In [92]:import numpy as np
...:import matplotlib.pyplot as plt
In [93]: x = np.random.uniform(-3,3,size=100)...: X = x.reshape(-1,1)...: y =0.5* x**2+ x +2+ np.random.normal(0,1,size=100)
使用scikit-learn获得多项式特征
In [94]:from sklearn.preprocessing import PolynomialFeatures
#要为原有的数据集添加相对原有特征的几次幂
In [95]: poly = PolynomialFeatures(degree=2)...: poly.fit(X)...: X2 = poly.transform(X)
In [97]: X[:5,:]
Out[97]:
array([[-2.19518176],[2.77476822],[0.48917013],[0.39238714],[-0.23027698]])#得到的升维后的特征
In [98]: X2[:5,:]
Out[98]:
array([[1.,-2.19518176,4.81882296],[1.,2.77476822,7.69933869],[1.,0.48917013,0.23928742],[1.,0.39238714,0.15396766],[1.,-0.23027698,0.05302749]])
使用线性回归进行训练和预测,得到非线性的曲线
In [99]:from sklearn.linear_model import LinearRegression
...: lin_reg2 = LinearRegression()...: lin_reg2.fit(X2,y)...: y_predict2 = lin_reg2.predict(X2)
In [100]: plt.scatter(x,y)...: plt.plot(np.sort(x),y_predict2[np.argsort(x)],color='r')
In [101]: lin_reg2.coef_
Out[101]: array([0.,0.99420228,0.55194625])
In [102]: lin_reg2.intercept_
Out[102]:1.8569459632147785
数据集有两个特征
如果有两个特征x1、x2,则会生成3列二次幂的特征x1^2,x2^2,x1*x2
In [103]: X = np.arange(1,11).reshape(-1,2)
In [104]: X.shape
Out[104]:(5,2)
In [105]: X
Out[105]:
array([[1,2],[3,4],[5,6],[7,8],[9,10]])
In [106]: poly = PolynomialFeatures(degree=2)...: poly.fit(X)...: X2 = poly.transform(X)
In [107]: X2.shape
Out[107]:(5,6)
In [108]: X2
Out[108]:
array([[1.,1.,2.,1.,2.,4.],[1.,3.,4.,9.,12.,16.],[1.,5.,6.,25.,30.,36.],[1.,7.,8.,49.,56.,64.],[1.,9.,10.,81.,90.,100.]])
三次幂会生成10个特征
Pipeline
送给管道的数据会沿着管道中定义的三步依次进行下去
In [115]:from sklearn.preprocessing import StandardScaler
...:from sklearn.pipeline import Pipeline
...: poly_reg = Pipeline([...:("poly",PolynomialFeatures(degree=2)),...:("std_scaler",StandardScaler()),...:("lin_reg",LinearRegression())...:])
In [116]: poly_reg.fit(X,y)...: y_predict = poly_reg.predict(X)