第15部分 scikit-learn机器学习
文章目录
1.简介
- 自2007年发布以来,scikit-learn已经成为Python重要的机器学习库了。scikit-learn简称sklearn,支持包括分类、回归、降维和聚类四大机器学习算法。还包含了特征提取、数据处理和模型评估三大模块。
- sklearn是Scipy的扩展,建立在NumPy和matplotlib库的基础上。利用这几大模块的优势,可以大大提高机器学习的效率。
- sklearn拥有着完善的文档,上手容易,具有着丰富的API,在学术界颇受欢迎。sklearn已经封装了大量的机器学习算法,包括LIBSVM和LIBINEAR。同时sklearn内置了大量数据集,节省了获取和整理数据集的时间。
2. 回归
- 回归分析(regression analysis)是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法,运用十分广泛,回归分析按照涉及的变量的多少,分为一元回归和多元回归分析;按照因变量的多少,可分为简单回归分析和多重回归分析;按照自变量和因变量之间的关系类型,可分为线性回归分析和非线性回归分析。如果在回归分析中,只包括一个自变量和一个因变量,且二者的关系可用一条直线近似表示,这种回归分析称为一元线性回归分析。如果回归分析中包括两个或两个以上的自变量,且自变量之间存在线性相关,则称为多重线性回归分析。
- 基本回归:线性、决策树、SVM、KNN
- 集成方法:随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees
为了实验用,模拟了一个二元函数, y = 0.5 s i n ( x 1 ) + 0.5 c o s ( x 2 ) + 0.1 x 1 + 3 y=0.5sin(x1)+ 0.5cos(x2)+0.1x1+3 y=0.5sin(x1)+0.5cos(x2)+0.1x1+3。其中x1的取值范围是[0,50],x2的取值范围是[-10,10], x1和x2的训练集一共有500个,测试集有100个。其中,在训练集的上加了一个-0.5~0.5的噪声。生成函数的代码如下:
###########1.数据生成函数##########
def f(x1, x2):
y = 0.5 * np.sin(x1) + 0.5 * np.cos(x2) + 0.1 * x1 + 3
return y
#产生测试集和训练集
def load_data():
x1_train = np.linspace(0,50,500)
x2_train = np.linspace(-10,10,500)
data_train = np.array([[x1,x2,f(x1,x2) + (np.random.random(1)-0.5)] for x1,x2 in zip(x1_train, x2_train)]) #添加噪声
x1_test = np.linspace(0,50,100)+ 0.5 * np.random.random(100) #添加噪声,产生测试集
x2_test = np.linspace(-10,10,100) + 0.02 * np.random.random(100)
data_test = np.array([[x1,x2,f(x1,x2)] for x1,x2 in zip(x1_test, x2_test)])
return data_train, data_test
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = (12.0, 8.0)
train, test = load_data()
x1 = range(0,500)
x2 = range(0,500,5)
plt.plot(x1, train[:,2],'r-o')
plt.plot(x2, test[:,2],'b-o');
2.2 回归分析
###########1.数据生成部分##########
train, test = load_data()
x_train, y_train = train[:,:2], train[:,2] #数据前两列是x1,x2 第三列是y,这里的y有随机噪声
x_test ,y_test = test[:,:2], test[:,2] # 同上,不过这里的y没有噪声
###########2.回归部分##########
def try_different_method(model):
model.fit(x_train,y_train) #给定feather和target,fit训练
score = model.score(x_test, y_test) #综合值,可以看成评级指标
result = model.predict(x_test) #预测值
plt.figure()
plt.plot(np.arange(len(result)), y_test,'go-',label='true value')
plt.plot(np.arange(len(result)),result,'ro-',label='predict value')
plt.title('score: %f'%score)
plt.legend()
plt.show()
###########3.具体方法选择##########
####3.1决策树回归####
from sklearn import tree
model_DecisionTreeRegressor = tree.DecisionTreeRegressor()
####3.2线性回归####
from sklearn import linear_model
model_LinearRegression = linear_model.LinearRegression()
####3.3SVM回归####
from sklearn import svm
model_SVR = svm.SVR()
####3.4KNN回归####
from sklearn import neighbors
model_KNeighborsRegressor = neighbors.KNeighborsRegressor()
####3.5随机森林回归####
from sklearn import ensemble
model_RandomForestRegressor = ensemble.RandomForestRegressor(n_estimators=20)#这里使用20个决策树
####3.6Adaboost回归####
from sklearn import ensemble
model_AdaBoostRegressor = ensemble.AdaBoostRegressor(n_estimators=50)#这里使用50个决策树
####3.7GBRT回归####
from sklearn import ensemble
model_GradientBoostingRegressor = ensemble.GradientBoostingRegressor(n_estimators=100)#这里使用100个决策树
####3.8Bagging回归####
from sklearn.ensemble import BaggingRegressor
model_BaggingRegressor = BaggingRegressor()
####3.9ExtraTree极端随机树回归####
from sklearn.tree import ExtraTreeRegressor
model_ExtraTreeRegressor = ExtraTreeRegressor()
try_different_method(model_DecisionTreeRegressor)
#try_different_method(model_LinearRegression)
#try_different_method(model_SVR)
#try_different_method(model_KNeighborsRegressor)
#try_different_method(model_RandomForestRegressor)
#try_different_method(model_AdaBoostRegressor)
#try_different_method(model_GradientBoostingRegressor)
# try_different_method(model_BaggingRegressor)
#try_different_method(model_ExtraTreeRegressor)
2.3 案例分析:波士顿房价回归分析
from sklearn.datasets import load_boston #load房价的库
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
# 1 准备数据
# 读取波士顿地区房价信息
boston = load_boston()
print(boston.DESCR)
… _boston_dataset:
Boston house prices dataset
Data Set Characteristics:
:Number of Instances: 506 #样本数 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): #一些可能对房价有影响的信息 - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/This dataset was taken from the StatLib library which is maintained at
Carnegie Mellon University.The Boston house-price data of Harrison, D. and Rubinfeld, D.L.
‘Hedonic prices and the demand for clean air’, J. Environ. Economics &
Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch,
‘Regression diagnostics …’, Wiley, 1980. N.B. Various
transformations are used in the table on pages 244-261 of the latter.The Boston house-price data has been used in many machine learning
papers that address regression problems.
… topic:: References
- Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity’, Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of
Machine Learning, 236-243, University of Massachusetts, Amherst.
Morgan Kaufmann.
#数据探索
x = boston.data #feather,numpy类型
y = boston.target #target
# 查看数据的差异情况
print("最大房价:", np.max(boston.target)) # 50
print("最小房价:",np.min(boston.target)) # 5
print("平均房价:", np.mean(boston.target)) # 22.532806324110677
最大房价: 50.0
最小房价: 5.0
平均房价: 22.532806324110677
# 2 分割训练数据和测试数据
# 随机采样25%作为测试,75%作为训练
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33) #random_state:随机种子
print(x_train.shape)
print(y_train.shape)
print(x_train[0,:</