sklearn加载查看数据集

最新推荐文章于 2024-05-22 17:12:34 发布

Sk8er-boi

最新推荐文章于 2024-05-22 17:12:34 发布

阅读量5.1k

点赞数 1

文章标签： python 深度学习机器学习

本文链接：https://blog.csdn.net/q2457374961/article/details/109198952

版权

Scikit-learn(sklearn)的定位是通用机器学习库
一般使用SciKit-Learn来加载数据集。
数据集的来源，通常有2个：
自己准备
第三方处获取
：SciKit-Learn是SciKit库的一部分，SciKit意思是SciPy Tookits，名字来源于SciPy库，
SciKit基于SciPy库构建，除了SciKit-Learn，还包含其他很多模块。
SciKit-Learn库是专注于机器学习和数据挖掘的模块。

SciKit-Learn库中也自带一些数据集，我们可以尝试加载。
先从sklearn导入数据集模块，然后，可以使用数据集中的load_digits()方法加载数据:
digits手写字体数据集中含1797个样本，每个样本包括8*8像素的图像和一个[0, 9]整数的标签

sklearn dataset模块

sklearn.datasets模块主要提供了一些导入、在线下载及本地生成数据集的方法，可以通过dir或help命令查看，我们会发现主要有三种形式：datasets.load_()、datasets.fetch_()及datasets.make_*()的方法。*为数据集名称
① datasets.load_dataset_name（）：sklearn包自带的小数据集
②datasets.fetch_dataset_name（）：比较大的数据集，主要用于测试解决实际问题，支持在线下载
③datasets.make_dataset_name（）：构造数据集

from sklearn import datasets#调用skelearn中的datasets模块
data=datasets.load_digits()#加载数据集
print(data)#输出数据集和标签

{'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]]), 'target': array([0, 1, 2, ..., 8, 9, 8]), 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), 'images': array([[[ 0.,  0.,  5., ...,  1.,  0.,  0.],
        [ 0.,  0., 13., ..., 15.,  5.,  0.],
        [ 0.,  3., 15., ..., 11.,  8.,  0.],
        ...,
        [ 0.,  4., 11., ..., 12.,  7.,  0.],
        [ 0.,  2., 14., ..., 12.,  0.,  0.],
        [ 0.,  0.,  6., ...,  0.,  0.,  0.]],

       [[ 0.,  0.,  0., ...,  5.,  0.,  0.],
        [ 0.,  0.,  0., ...,  9.,  0.,  0.],
        [ 0.,  0.,  3., ...,  6.,  0.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.]],

       [[ 0.,  0.,  0., ..., 12.,  0.,  0.],
        [ 0.,  0.,  3., ..., 14.,  0.,  0.],
        [ 0.,  0.,  8., ..., 16.,  0.,  0.],
        ...,
        [ 0.,  9., 16., ...,  0.,  0.,  0.],
        [ 0.,  3., 13., ..., 11.,  5.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.]],

       ...,

       [[ 0.,  0.,  1., ...,  1.,  0.,  0.],
        [ 0.,  0., 13., ...,  2.,  1.,  0.],
        [ 0.,  0., 16., ..., 16.,  5.,  0.],
        ...,
        [ 0.,  0., 16., ..., 15.,  0.,  0.],
        [ 0.,  0., 15., ..., 16.,  0.,  0.],
        [ 0.,  0.,  2., ...,  6.,  0.,  0.]],

       [[ 0.,  0.,  2., ...,  0.,  0.,  0.],
        [ 0.,  0., 14., ..., 15.,  1.,  0.],
        [ 0.,  4., 16., ..., 16.,  7.,  0.],
        ...,
        [ 0.,  0.,  0., ..., 16.,  2.,  0.],
        [ 0.,  0.,  4., ..., 16.,  2.,  0.],
        [ 0.,  0.,  5., ..., 12.,  0.,  0.]],

       [[ 0.,  0., 10., ...,  1.,  0.,  0.],
        [ 0.,  2., 16., ...,  1.,  0.,  0.],
        [ 0.,  0., 15., ..., 15.,  0.,  0.],
        ...,
        [ 0.,  4., 16., ..., 16.,  6.,  0.],
        [ 0.,  8., 16., ..., 16.,  8.,  0.],
        [ 0.,  1.,  8., ..., 12.,  1.,  0.]]]), 'DESCR': ".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\n.. topic:: References\n\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000."}

查看数据集，有以下方法和函数

keys 查看数据内容

data 样本数据，是 n_samples * n_features 的二维 numpy.ndarray 数组
target 标签数组，是 n_samples 的一维 numpy.ndarray 数组
target_names 标签名称
images 图像格式(二维)的样本数据
DESCR 描述信息

data.keys()

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

print(data.data)#查看样本数据
print(data.data[0])
print(data.data.shape)#有1797个样本，每个样本有64个特征值(实际上是像素灰度值),所以shape是(1797, 64)

[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]
[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
 15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
  0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
  0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
(1797, 64)

print(data.images)#查看图像格式(二维)的样本数据
print(data.images[0])
print(data.images.shape)#与data样本数据格式不同

[[[ 0.  0.  5. ...  1.  0.  0.]
  [ 0.  0. 13. ... 15.  5.  0.]
  [ 0.  3. 15. ... 11.  8.  0.]
  ...
  [ 0.  4. 11. ... 12.  7.  0.]
  [ 0.  2. 14. ... 12.  0.  0.]
  [ 0.  0.  6. ...  0.  0.  0.]]

 [[ 0.  0.  0. ...  5.  0.  0.]
  [ 0.  0.  0. ...  9.  0.  0.]
  [ 0.  0.  3. ...  6.  0.  0.]
  ...
  [ 0.  0.  1. ...  6.  0.  0.]
  [ 0.  0.  1. ...  6.  0.  0.]
  [ 0.  0.  0. ... 10.  0.  0.]]

 [[ 0.  0.  0. ... 12.  0.  0.]
  [ 0.  0.  3. ... 14.  0.  0.]
  [ 0.  0.  8. ... 16.  0.  0.]
  ...
  [ 0.  9. 16. ...  0.  0.  0.]
  [ 0.  3. 13. ... 11.  5.  0.]
  [ 0.  0.  0. ... 16.  9.  0.]]

 ...

 [[ 0.  0.  1. ...  1.  0.  0.]
  [ 0.  0. 13. ...  2.  1.  0.]
  [ 0.  0. 16. ... 16.  5.  0.]
  ...
  [ 0.  0. 16. ... 15.  0.  0.]
  [ 0.  0. 15. ... 16.  0.  0.]
  [ 0.  0.  2. ...  6.  0.  0.]]

 [[ 0.  0.  2. ...  0.  0.  0.]
  [ 0.  0. 14. ... 15.  1.  0.]
  [ 0.  4. 16. ... 16.  7.  0.]
  ...
  [ 0.  0.  0. ... 16.  2.  0.]
  [ 0.  0.  4. ... 16.  2.  0.]
  [ 0.  0.  5. ... 12.  0.  0.]]

 [[ 0.  0. 10. ...  1.  0.  0.]
  [ 0.  2. 16. ...  1.  0.  0.]
  [ 0.  0. 15. ... 15.  0.  0.]
  ...
  [ 0.  4. 16. ... 16.  6.  0.]
  [ 0.  8. 16. ... 16.  8.  0.]
  [ 0.  1.  8. ... 12.  1.  0.]]]
[[ 0.  0.  5. 13.  9.  1.  0.  0.]
 [ 0.  0. 13. 15. 10. 15.  5.  0.]
 [ 0.  3. 15.  2.  0. 11.  8.  0.]
 [ 0.  4. 12.  0.  0.  8.  8.  0.]
 [ 0.  5.  8.  0.  0.  9.  8.  0.]
 [ 0.  4. 11.  0.  1. 12.  7.  0.]
 [ 0.  2. 14.  5. 10. 12.  0.  0.]
 [ 0.  0.  6. 13. 10.  0.  0.  0.]]
(1797, 8, 8)

print(data.target)#标签数组,
print(data.target.shape)#每个样本都有对应的标签值

[0 1 2 ... 8 9 8]
(1797,)

print(data.target_names)#标签名称

[0 1 2 3 4 5 6 7 8 9]

可以通过使用Python的数据可视化库matplotlib查看图片

# 查看指定序列的图片
import matplotlib.pyplot as plt
import matplotlib  
plt.imshow(data.images[0])#0序列第一张图
plt.show()
plt.imshow(data.images[0], cmap = matplotlib.cm.binary)#用cmap参数，cmap = matplotlib.cm.binary转换为灰度图
plt.show()

在这里插入图片描述

figure对图像进行处理、再显示

from sklearn import datasets
# 加载 `digits` 数据集
digits = datasets.load_digits()
# 导入 matplotlib
import matplotlib.pyplot as plt
# 设置图形大小(宽、高)以英寸为单位
fig = plt.figure(figsize=(6, 6))
# 设置子图形布局，如间隔之类... 
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
# 对于64幅图像中的每一幅
for i in range(64):
    # 初始化子图:在8×8的网格中，在第i+1个位置添加一个子图
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    # 在第i个位置显示图像
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    # 用目标值标记图像
    ax.text(0, 7, str(digits.target[i]))

# 显示图形
plt.show()

在这里插入图片描述

示例
显示digits.images中的前8个手写数字图像，并用对应的目标值标记图像。

from sklearn import datasets
# 加载 `digits` 数据集
digits = datasets.load_digits()
# 导入 matplotlib
import matplotlib.pyplot as plt 
# 把图像和目标标签组合成一个列表
images_and_labels = list(zip(digits.images, digits.target))
# 对于列表(前8项)中的每个元素
for index, (image, label) in enumerate(images_and_labels[:8]):
    # 在第i+1个位置初始化一个2X4的子图
    plt.subplot(2, 4, index + 1)
    # 不要画坐标轴
    plt.axis('off')
    # 在所有子图中显示图像
    plt.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')
    # 为每个子图添加一个标题(目标标签)
    plt.title('Training: ' + str(label))
# 显示图形
plt.show()

在这里插入图片描述

Sk8er-boi

关注

1
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
sklearn加载查看数据集

Scikit-learn(sklearn)的定位是通用机器学习库一般使用SciKit-Learn来加载数据集。数据集的来源，通常有2个：自己准备第三方处获取：SciKit-Learn是SciKit库的一部分，SciKit意思是SciPy Tookits，名字来源于SciPy库，SciKit基于SciPy库构建，除了SciKit-Learn，还包含其他很多模块。SciKit-Learn库是专注于机器学习和数据挖掘的模块。SciKit-Learn库中也自带一些数据集，我们可以尝试加载。先从sk
复制链接

扫一扫