【机器学习】sklearn数据集获取、分割、分类和回归

最新推荐文章于 2024-04-21 21:53:16 发布

Nancy_尚

最新推荐文章于 2024-04-21 21:53:16 发布

阅读量2.1k

点赞数 1

文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_45334823/article/details/107185271

版权

sklearn数据集

1、数据集划分
2、 sklearn分类数据集
3、 sklearn回归数据集

1、数据集划分

机器学习一般的数据集会划分为两个部分：
训练数据：用于训练，构建模型（分类、回归和聚类）
测试数据：在模型检验时使用，用于评估模型是否有效
划分的时候一般就是75%和25%的比例。

sklearn数据集划分API：sklearn.model_selection.train_test_split

1.1 获取数据

分为两种，一个是在datasets中的直接加载可以使用的，另一个一个是需要下载的大规模的数据集。

sklearn.datasets
加载获取流行数据集
datasets.load_*()
获取小规模数据集，数据包含在datasets里

datasets.fetch_*(data_home=None)
获取大规模数据集，需要从网络上下载，函
	数的第一个参数是data_home，表示数据集
	下载的目录,默认是 ~/scikit_learn_data/

1.2 获取数据返回的类型

load*和fetch*返回的数据类型datasets.base.Bunch(字典格式)

data：特征数据数组，是 [n_samples * n_features] 的二维 numpy.ndarray 数组

target：标签数组，是 n_samples 的一维 numpy.ndarray 数组

DESCR：数据描述

feature_names：特征名,新闻数据，手写数字、回归数据集没有

target_names：标签名

举个栗子：

**
sklearn.datasets.load_iris() 加载并返回鸢尾花数据集
在这里插入图片描述
这是一个150行4列的矩阵数组。来看一下如何实现数据加载的：

from sklearn.datasets import load_iris
li = load_iris()
print("获取特征值")
print(li.data)
print("目标值")
print(li.target)

其中li就是datasets.base.Bunch的格式，
然后运行输出：

目标值
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:

最低0.47元/天解锁文章

Nancy_尚

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【机器学习】sklearn数据集获取、分割、分类和回归

sklearn数据集1、数据集划分机器学习一般的数据集会划分为两个部分：训练数据：用于训练，构建模型（分类、回归和聚类）测试数据：在模型检验时使用，用于评估模型是否有效划分的时候一般就是75%和25%的比例。sklearn数据集划分API：sklearn.model_selection.train_test_split获取数据分为两种，一个是在datasets中的直接加载可以使用的，另一个一个是需要下载的大规模的数据集。sklearn.datasets加载获取流行数据集datasets
复制链接

扫一扫