9. scikit-learn机器学习

最新推荐文章于 2021-09-06 15:32:50 发布

周纠纠

最新推荐文章于 2021-09-06 15:32:50 发布

阅读量619

点赞数

分类专栏：金融科技-计算机相关 # python - 数据分析

本文链接：https://blog.csdn.net/qq_40947195/article/details/105256104

版权

本文介绍了scikit-learn机器学习库，包括其主要功能和常用算法。详细探讨了回归分析，如波士顿房价回归案例，讨论了均方误差（MSE）、均方根误差（RMSE）、MAE和R Squared等评价指标。接着，文章转向分类，解释了分类的基本概念，如K折交叉验证和评估指标。最后，简述了聚类，特别是K-means和层次聚类的原理与应用。

摘要由CSDN通过智能技术生成

第15部分 scikit-learn机器学习

1.简介

自2007年发布以来，scikit-learn已经成为Python重要的机器学习库了。scikit-learn简称sklearn，支持包括分类、回归、降维和聚类四大机器学习算法。还包含了特征提取、数据处理和模型评估三大模块。
sklearn是Scipy的扩展，建立在NumPy和matplotlib库的基础上。利用这几大模块的优势，可以大大提高机器学习的效率。
sklearn拥有着完善的文档，上手容易，具有着丰富的API，在学术界颇受欢迎。sklearn已经封装了大量的机器学习算法，包括LIBSVM和LIBINEAR。同时sklearn内置了大量数据集，节省了获取和整理数据集的时间。

2. 回归

回归分析（regression analysis)是确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法，运用十分广泛，回归分析按照涉及的变量的多少，分为一元回归和多元回归分析；按照因变量的多少，可分为简单回归分析和多重回归分析；按照自变量和因变量之间的关系类型，可分为线性回归分析和非线性回归分析。如果在回归分析中，只包括一个自变量和一个因变量，且二者的关系可用一条直线近似表示，这种回归分析称为一元线性回归分析。如果回归分析中包括两个或两个以上的自变量，且自变量之间存在线性相关，则称为多重线性回归分析。
基本回归：线性、决策树、SVM、KNN
集成方法：随机森林、Adaboost、GradientBoosting、Bagging、ExtraTrees

为了实验用，模拟了一个二元函数， $y = 0.5 s i n (x 1) + 0.5 c o s (x 2) + 0.1 x 1 + 3$ 。其中x1的取值范围是[0,50]，x2的取值范围是[-10,10]， x1和x2的训练集一共有500个，测试集有100个。其中，在训练集的上加了一个-0.5~0.5的噪声。生成函数的代码如下：

###########1.数据生成函数##########
def f(x1, x2):
    y = 0.5 * np.sin(x1) + 0.5 * np.cos(x2)  + 0.1 * x1 + 3 
    return y
#产生测试集和训练集
def load_data():
    x1_train = np.linspace(0,50,500)
    x2_train = np.linspace(-10,10,500)
    data_train = np.array([[x1,x2,f(x1,x2) + (np.random.random(1)-0.5)] for x1,x2 in zip(x1_train, x2_train)])  #添加噪声
    x1_test = np.linspace(0,50,100)+ 0.5 * np.random.random(100)   #添加噪声，产生测试集
    x2_test = np.linspace(-10,10,100) + 0.02 * np.random.random(100)
    data_test = np.array([[x1,x2,f(x1,x2)] for x1,x2 in zip(x1_test, x2_test)])
    return data_train, data_test

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = (12.0, 8.0) 
train, test = load_data()
x1 = range(0,500)
x2 = range(0,500,5)
plt.plot(x1, train[:,2],'r-o')
plt.plot(x2, test[:,2],'b-o');

在这里插入图片描述

2.2 回归分析

###########1.数据生成部分##########
train, test = load_data()
x_train, y_train = train[:,:2], train[:,2] #数据前两列是x1,x2 第三列是y,这里的y有随机噪声
x_test ,y_test = test[:,:2], test[:,2] # 同上,不过这里的y没有噪声

###########2.回归部分##########
def try_different_method(model):
    model.fit(x_train,y_train)   #给定feather和target，fit训练
    score = model.score(x_test, y_test)  #综合值，可以看成评级指标
    result = model.predict(x_test)   #预测值
    plt.figure()
    plt.plot(np.arange(len(result)), y_test,'go-',label='true value')
    plt.plot(np.arange(len(result)),result,'ro-',label='predict value')
    plt.title('score: %f'%score)
    plt.legend()
    plt.show()


###########3.具体方法选择##########
####3.1决策树回归####
from sklearn import tree
model_DecisionTreeRegressor = tree.DecisionTreeRegressor()
####3.2线性回归####
from sklearn import linear_model
model_LinearRegression = linear_model.LinearRegression()
####3.3SVM回归####
from sklearn import svm
model_SVR = svm.SVR()
####3.4KNN回归####
from sklearn import neighbors
model_KNeighborsRegressor = neighbors.KNeighborsRegressor()
####3.5随机森林回归####
from sklearn import ensemble
model_RandomForestRegressor = ensemble.RandomForestRegressor(n_estimators=20)#这里使用20个决策树
####3.6Adaboost回归####
from sklearn import ensemble
model_AdaBoostRegressor = ensemble.AdaBoostRegressor(n_estimators=50)#这里使用50个决策树
####3.7GBRT回归####
from sklearn import ensemble
model_GradientBoostingRegressor = ensemble.GradientBoostingRegressor(n_estimators=100)#这里使用100个决策树
####3.8Bagging回归####
from sklearn.ensemble import BaggingRegressor
model_BaggingRegressor = BaggingRegressor()
####3.9ExtraTree极端随机树回归####
from sklearn.tree import ExtraTreeRegressor
model_ExtraTreeRegressor = ExtraTreeRegressor()

try_different_method(model_DecisionTreeRegressor)
#try_different_method(model_LinearRegression)
#try_different_method(model_SVR)
#try_different_method(model_KNeighborsRegressor)
#try_different_method(model_RandomForestRegressor)
#try_different_method(model_AdaBoostRegressor)
#try_different_method(model_GradientBoostingRegressor)
# try_different_method(model_BaggingRegressor)
#try_different_method(model_ExtraTreeRegressor)

在这里插入图片描述

2.3 案例分析：波士顿房价回归分析

from sklearn.datasets import load_boston   #load房价的库
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

# 1 准备数据
# 读取波士顿地区房价信息
boston = load_boston()
print(boston.DESCR)

… _boston_dataset:

Boston house prices dataset

Data Set Characteristics:
:Number of Instances: 506   #样本数

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):  #一些可能对房价有影响的信息
    - CRIM     per capita crime rate by town
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    - INDUS    proportion of non-retail business acres per town
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    - NOX      nitric oxides concentration (parts per 10 million)
    - RM       average number of rooms per dwelling
    - AGE      proportion of owner-occupied units built prior to 1940
    - DIS      weighted distances to five Boston employment centres
    - RAD      index of accessibility to radial highways
    - TAX      full-value property-tax rate per $10,000
    - PTRATIO  pupil-teacher ratio by town
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    - LSTAT    % lower status of the population
    - MEDV     Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at
Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L.
‘Hedonic prices and the demand for clean air’, J. Environ. Economics &
Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch,
‘Regression diagnostics …’, Wiley, 1980. N.B. Various
transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning
papers that address regression problems.
… topic:: References

Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity’, Wiley, 1980. 244-261.

Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of
Machine Learning, 236-243, University of Massachusetts, Amherst.
Morgan Kaufmann.

#数据探索
x = boston.data  #feather,numpy类型
y = boston.target  #target
# 查看数据的差异情况
print("最大房价：", np.max(boston.target))   # 50
print("最小房价：",np.min(boston.target))    # 5
print("平均房价：", np.mean(boston.target))   # 22.532806324110677

最大房价： 50.0
最小房价： 5.0
平均房价： 22.532806324110677

# 2 分割训练数据和测试数据
# 随机采样25%作为测试，75%作为训练
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33)  #random_state：随机种子
print(x_train.shape)
print(y_train.shape)
print(x_train[0,:</