sklearn中digits手写字体数据集

1. 导入

from sklearn import datasets
digits = datasets.load_digits()

2. 属性查看

  • digits: bunch类型
print(digits.keys())

dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

3. 具体数据

  • 1797个样本,每个样本包括8*8像素的图像和一个[0, 9]整数的标签

3.1 images

  • ndarray类型,保存8*8的图像,里面的元素是float64类型,共有1797张图片
  • 用于显示图片
  • import matplotlib.pyplot as plt
    plt.imshow(digits.images[0])
                          <matplotlib.image.AxesImage at 0x1676ca13ba8>
    plt.show()
  •  

 

# 获取第一张图片
print(digits.images[0])
[[  0.   0.   5.  13.   9.   1.   0.   0.]
 [  0.   0.  13.  15.  10.  15.   5.   0.]
 [  0.   3.  15.   2.   0.  11.   8.   0.]
 [  0.   4.  12.   0.   0.   8.   8.   0.]
 [  0.   5.   8.   0.   0.   9.   8.   0.]
 [  0.   4.  11.   0.   1.  12.   7.   0.]
 [  0.   2.  14.   5.  10.  12.   0.   0.]
 [  0.   0.   6.  13.  10.   0.   0.   0.]]
  • 或者

  • from skimage import io
    im=plt.imshow(digits.images[0])
    print(type(im))
                  <class 'matplotlib.image.AxesImage'>
    io.show()

 

3.2 data

  • ndarray类型,将images按行展开成一行,共有1797行
  • 输入数据
  • print(digits.data[0])
    [  0.   0.   5.  13.   9.   1.   0.   0.   0.   0.  13.  15.  10.  15.   5.
       0.   0.   3.  15.   2.   0.  11.   8.   0.   0.   4.  12.   0.   0.   8.
       8.   0.   0.   5.   8.   0.   0.   9.   8.   0.   0.   4.  11.   0.   1.
      12.   7.   0.   0.   2.  14.   5.  10.  12.   0.   0.   0.   0.   6.  13.
      10.   0.   0.   0.]

3.3 target

  • ndarray类型,指明每张图片的标签,也就是每张图片代表的数字
  • 输出数据,标签
  • print(digits.target[0])
    
    0

    3.4 target_names

  • ndarray类型,数据集中所有标签值
  • print(digits.target_names)
    [0 1 2 3 4 5 6 7 8 9]

    3.5 DESCR

  • 数据集的描述,作者,数据来源等
  • print(digits.DESCR)
    .. _digits_dataset:
    
    Optical recognition of handwritten digits dataset
    --------------------------------------------------
    
    **Data Set Characteristics:**
    
        :Number of Instances: 5620
        :Number of Attributes: 64
        :Attribute Information: 8x8 image of integer pixels in the range 0..16.
        :Missing Attribute Values: None
        :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
        :Date: July; 1998
    
    This is a copy of the test set of the UCI ML hand-written digits datasets
    http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
    
    The data set contains images of hand-written digits: 10 classes where
    each class refers to a digit.
    
    Preprocessing programs made available by NIST were used to extract
    normalized bitmaps of handwritten digits from a preprinted form. From a
    total of 43 people, 30 contributed to the training set and different 13
    to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
    4x4 and the number of on pixels are counted in each block. This generates
    an input matrix of 8x8 where each element is an integer in the range
    0..16. This reduces dimensionality and gives invariance to small
    distortions.
    
    For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
    T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
    L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
    1994.
    
    .. topic:: References
    
      - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
        Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
        Graduate Studies in Science and Engineering, Bogazici University.
      - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
      - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
        Linear dimensionalityreduction using relevance weighted LDA. School of
        Electrical and Electronic Engineering Nanyang Technological University.
        2005.
      - Claudio Gentile. A New Approximate Maximal Margin Classification
        Algorithm. NIPS. 2000.


                        
        
         
            
                            
                    
     

 

 

 

 

 

 

 

好的,我来回答你的问题。 首先,K-means算法是一种基于距离的聚类算法,它的核心思想是将数据集分为K个不同的簇,每个簇都具有相似的特征。K-means算法的流程大致如下: 1. 随机初始化K个聚类心。 2. 对于每个数据点,计算其到K个聚类心的距离,将其分配到距离最近的那个聚类心所在的簇。 3. 对于每个簇,重新计算其聚类心。 4. 重复步骤2和步骤3,直到聚类心不再发生变化或达到最大迭代次数。 接下来,我们可以使用sklearn自带的手写字体数据集来演示K-means算法的使用。首先,我们需要导入相关的库: ```python from sklearn.datasets import load_digits from sklearn.cluster import KMeans from sklearn.metrics import accuracy_score ``` 然后,我们加载手写字体数据集,并将其分为训练集和测试集: ```python digits = load_digits() data = digits.data labels = digits.target train_size = int(len(data) * 0.8) train_data = data[:train_size] train_labels = labels[:train_size] test_data = data[train_size:] test_labels = labels[train_size:] ``` 接着,我们可以使用K-means算法对训练集进行聚类: ```python kmeans = KMeans(n_clusters=10, random_state=0) kmeans.fit(train_data) ``` 在得到聚类心后,我们可以使用其对测试集进行分类,并使用accuracy_score函数计算其准确率: ```python test_predictions = kmeans.predict(test_data) accuracy = accuracy_score(test_labels, test_predictions) print("Accuracy: {:.2f}%".format(accuracy * 100)) ``` 最终,我们得到的准确率约为 74.44%。需要注意的是,由于手写字体数据集是一个无监督的数据集,因此我们无法使用传统的准确率来评估模型的性能,这里仅仅是为了演示K-means算法在手写字体数据集上的使用方式。 希望我的回答对你有所帮助!
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值