第六章使用scikit-learn构建模型

最新推荐文章于 2024-06-30 12:48:32 发布

流光2021

最新推荐文章于 2024-06-30 12:48:32 发布

阅读量2.5k

点赞数 6

分类专栏： Python数据分析与应用文章标签：数据分析 python

本文链接：https://blog.csdn.net/qq_44913558/article/details/116201318

版权

本文详细介绍了如何使用scikit-learn库处理数据，包括加载数据集、划分训练测试集、数据预处理和降维。接着，通过构建聚类、分类和回归模型，并进行评价，展示了K-Means、SVM和线性回归的使用。最后，通过两个实训项目加深了对模型构建的理解，涉及到wine和wine_quality数据集的k-Means聚类和回归模型的构建与评估。

摘要由CSDN通过智能技术生成

第六章使用scikit-learn构建模型

任务6.1 使用sklearn转换器处理数据

6.1.1 加载datasets模块中的数据集

#加载breast_canser数据集
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
print('breast_cancer数据集的长度为:',len(cancer))
print('breast_cancer数据集的类型为:',type(cancer))

breast_cancer数据集的长度为: 6
breast_cancer数据集的类型为: <class 'sklearn.utils.Bunch'>

#sklearn自带数据集内部信息获取
cancer_data=cancer['data']
print('breast_cancer数据集的数据为:\n',cancer_data)

breast_cancer数据集的数据为:
 [[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]

cancer_target=cancer['target']
print('breast_cancer数据集的标签为:\n',cancer_target)

breast_cancer数据集的标签为:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 0 0 0 0 0 1]

cancer_names=cancer['feature_names']#取出数据集特征名
print('breast_cancer数据集的标签为:\n',cancer_names)

breast_cancer数据集的标签为:
 ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

cancer_desc=cancer['DESCR']#取出数据集的描述信息
print('breast_cancer数据集的描述信息为:\n',cancer_desc)

breast_cancer数据集的描述信息为:
 .. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

.. topic:: References

   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.

6.1.2将数据集划分为训练集和测试集

#使用train_test_split划分数据集
print('原始数据集数据的形状为:',cancer_data.shape)
print('原始数据集标签的形状为:',cancer_target.shape)

原始数据集数据的形状为: (569, 30)
原始数据集标签的形状为: (569,)

from sklearn.model_selection import train_test_split
cancer_data_train,cancer_data_test,cancer_target_train,cancer_target_test=train_test_split(cancer_data,cancer_target,test_size=0.2,random_state=42)
print('训练集数据的形状为',cancer_data_train.shape)
print('训练集标签的形状为',cancer_target_train.shape)
print('测试集数据的形状为',cancer_data_test.shape)
print('测试集标签的形状为',cancer_target_test.shape)

训练集数据的形状为 (455, 30)
训练集标签的形状为 (455,)
测试集数据的形状为 (114, 30)
测试集标签的形状为 (114,)

6.1.3使用sklearn转换器进行数据预处理与降维

#离差标准化
import numpy as np
from sklearn.preprocessing import MinMaxScaler
Scaler=MinMaxScaler().fit(cancer_data_train)#生成规则
#将规则应用于训练集
cancer_trainScaler=Scaler.transform(cancer_data_train)
#将规则应用于测试集
cancer_testScaler=Scaler.transform(cancer_data_test)
print('离差标准化前训练集数据的最小值为:',np.min(cancer_data_train))
print('离差标准化后训练集数据的最小值为:',np.min(cancer_trainScaler))
print('离差标准化前训练集数据的最大值为:',np.max(cancer_data_train))
print('离差标准化后训练集数据的最大值为:',np.max(cancer_trainScaler))
print('离差标准化前测试集数据的最小值为:',np.min(cancer_data_test))
print('离差标准化后测试集数据的最小值为:',np.min(cancer_testScaler))
print('离差标准化前测试集数据的最大值为:',np.max(cancer_data_test))
print('离差标准化后测试集数据的最大值为:',np.max(cancer_testScaler))

离差标准化前训练集数据的最小值为: 0.0
离差标准化后训练集数据的最小值为: 0.0
离差标准化前训练集数据的最大值为: 4254.0
离差标准化后训练集数据的最大值为: 1.0000000000000002
离差标准化前测试集数据的最小值为: 0.0
离差标准化后测试集数据的最小值为: -0.057127602776294695
离差标准化前测试集数据的最大值为: 3432.0
离差标准化后测试集数据的最大值为: 1.3264399566986453

#对breast_canner数据集PCA降维
from sklearn.decomposition import PCA
pca_model=PCA(n_components=10).fit(cancer_trainScaler)#n_components未指定时，代表所有特征均会被保留下来，如果为10，代表会将原始数据降低到10个维度，如果
#为float,则PCA会根据样本特征方差来决定降维后的维度数，赋值为mle,则pca会用MLES算法根据特征的方差分布情况自动选择一定的数量的主成分特征来降维
#将规则应用于训练集
cancer_trainPca=pca_model.transform(cancer_trainScaler)
#将规则应用于测试集
cancer_testPca=pca_model.transform(cancer_testScaler)
print('PCA降维前训练集数据为:',cancer_trainScaler.shape)
print('PCA降维后训练集数据为:',cancer_trainPca.shape)
print('PCA降维前训练集数据的形状为:',cancer_testScaler.shape)
print('PCA降维后训练集数据的形状为:',cancer_testPca.shape)

PCA降维前训练集数据为: (455, 30)
PCA降维后训练集数据为: (455, 10)
PCA降维前训练集数据的形状为: (114, 30)
PCA降维后训练集数据的形状为: (114, 10)

6.1.4 任务实现

1.读取数据

#获取sklearn自带的boston数据集
from sklearn.datasets import load_boston
boston=load_boston()
boston_data=boston['data']
boston_target=boston['target']
boston_names=boston['feature_names']
print('boston数据集数据的形状为:',boston_data.shape)
print('boston数据集标签的形状为:',boston_target.shape)
print('boston数据集特征名的形状为:',boston_names.shape)

boston数据集数据的形状为: (506, 13)
boston数据集标签的形状为: (506,)
boston数据集特征名的形状为: (13,)

2.将数据集划分为训练集和测试集

#使用train_test_split划分为boston
from sklearn.model_selection import train_test_split
boston_data_train,boston_data_test,boston_target_train,boston_target_test=train_test_split(boston_data,boston_target,test_size=0.2,random_state=42)
print('训练集数据的形状为:',boston_data_train.shape)
print('训练集标签的形状为:',boston_target_train.shape)
print('测试集数据的形状为:',boston_data_test.shape)
print('测试集标签的形状为:',boston_target_test.shape)

训练集数据的形状为: (404, 13)
训练集标签的形状为: (404,)
测试集数据的形状为: (102, 13)
测试集标签的形状为: (102,)

3.使用转换器进行数据预处理

#使用stdScale.transform 进行数据预处理
from sklearn.preprocessing import StandardScaler
stdScale=StandardScaler().fit(boston_data_train)#生成规则
#将规则应用于训练集
boston_trainScaler=stdScale.transform(boston_data_train)
#将规则应用于测试集
boston_testScaler=stdScale.transform(boston_data_test)
print('标准差标准化后训练集数据的方差为:',np.var(boston_trainScaler))
print('标准差标准化后训练集数据的均值为:',np.mean(boston_trainScaler))
print('标准差标准化后测试集数据的方差为:',np.var(boston_testScaler))
print('标准差标准化后测试集数据的均值为:',np.mean(boston_testScaler))

标准差标准化后训练集数据的方差为: 1.0
标准差标准化后训练集数据的均值为: 1.3637225393110834e-15
标准差标准化后测试集数据的方差为: 0.9474773930196593
标准差标准化后测试集数据的均值为: 0.030537934487192598

4、使用转换器进行PCA降维

#使用pca.transform 进行PCA降维
from sklearn.decomposition import PCA
#生成规则
pca_model=PCA(n_components=5).fit(boston_trainScaler)#n_components未指定时，代表所有特征均会被保留下来，如果为10，代表会将原始数据降低到10个维度，如果
#为float,则PCA会根据样本特征方差来决定降维后的维度数，赋值为mle,则pca会用MLES算法根据特征的方差分布情况自动选择一定的数量的主成分特征来降维
#将规则应用于训练集
boston_trainPca=pca_model.transform(boston_trainScaler)
#将规则应用于测试集
boston_testPca=pca_model.transform(boston_testScaler)
print('PCA降维前训练集数据为:',boston_trainScaler.shape)
print('PCA降维后训练集数据为:',boston_trainPca.shape)
print('PCA降维前训练集数据的形状为:',boston_testScaler.shape)
print('PCA降维后训练集数据的形状为:',boston_testPca.shape)

PCA降维前训练集数据为: (404, 13)
PCA降维后训练集数据为: (404, 5)
PCA降维前训练集数据的形状为: (102, 13)
PCA降维后训练集数据的形状为: (102, 5)

任务6.2 构建并评价聚类模型

6.2.1 使用sklearn估计器构建聚类模型

from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
iris=load_iris()
iris_data=iris['data']
iris_target=iris['target']
iris_names=iris['feature_names']
scale=MinMaxScaler().fit(iris_data)#训练规则
iris_dataScale=scale.transform(iris_data)#应用规则
kmeans=KMeans(n_clusters=3,random_state=123).fit(iris_dataScale)
print('构建的k-means模型为:\n',kmeans)

构建的k-means模型为:
 KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=123, tol=0.0001, verbose=0)