Digit Recognizer Kaggle 竞赛系列

最新推荐文章于 2024-05-21 17:04:59 发布

海上机械师

最新推荐文章于 2024-05-21 17:04:59 发布

阅读量1.3k

点赞数 2

分类专栏： Python 机器学习文章标签：数字识别随机森林 K阶近邻主成分分析 kaggle

本文链接：https://blog.csdn.net/i_love_home/article/details/50807699

版权

机器学习同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

Python

7 篇文章 0 订阅

订阅专栏

手写数字识别

数据来源 Kaggle 中的竞赛项目 Digit Recognizer。
项目报告已委托在我的 github 上。
Kaggle 的入门可以参照作者 wphh 的博文。

案例分析

数据文件

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

训练集

The training data set, (train.csv), has 785 columns. The first column, called “label”, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).

For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.

Visually, if we omit the “pixel” prefix, the pixels make up the image like this:

< Input Data:
000 001 002 003 … 026 027
028 029 030 031 … 054 055
056 057 058 059 … 082 083
| | | | … | |
728 729 730 731 … 754 755
756 757 758 759 … 782 783
>

测试集

The test data set, (test.csv), is the same as the training set, except that it does not contain the “label” column.

Your submission file should be in the following format: For each of the 28000 images in the test set, output a single line with the digit you predict. For example, if you predict that the first image is of a 3, the second image is of a 7, and the third image is of a 8, then your submission file would look like:

< Output Data
3
7
8
(27997 more lines)
>

The evaluation metric for this contest is the categorization accuracy, or the proportion of test images that are correctly classified. For example, a categorization accuracy of 0.97 indicates that you have correctly classified all but 3% of the images.

解题报告

尝试了 K阶近邻、决策树、随机森林等方法，最终最好的结果是

RandomForestClassifier(n_estimators=100), Accuracy = 0.96443

当然，主成分分析也被用来降维提高效率，不过结果并不理想。

测试结果如下：

KNN：Accuracy = 0.83886
KNN 算法在预测过程中花费的时间很长
IPCA + KNN：Accuracy = 0.84614
IPCA 降维可能会遇到超出内存的问题
注：测试程序的电脑内存是 8GB
IPCA + RandomForest：Accuracy = 0.83843
RandomForest 随机森林的效率显然比 KNN 高很多，训练与预测效果都非常好
RandomForest：Accuracy = 0.96443
随机森林是 4 种算法中效果最好的，但同时从论坛(Forum)中得到的信息，
深度学习中的卷积神经网络能达到 Accuracy = 0.99+ 的效果

详细代码如下：

import csv
import numpy as np
from numpy import ravel
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn import datasets,  metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA, IncrementalPCA

################################################################################
# 数据读取模块

# 读取 csv 数据 train
def csvTrainDataRead():
    print("Load train data...")
    features, labels = [], []
    with open('train.csv') as myCSV:
        reader = csv.reader(myCSV)
        for index, row in enumerate(reader):
            if index > 0:
                labels.append(row[:1])
                features.append(row[1:])

    # list -> array
    features = np.float_(features)
    labels = ravel(np.int_(labels))

    return features, labels

# 读取 csv 数据 test
def csvTestDataRead():
    print("Load test data...")
    features = []
    with open('test.csv') as myCSV:
        reader = csv.reader(myCSV)
        for index, row in enumerate(reader):
            if index > 0:
                features.append(row)
    # list -> array.float
    features = np.float_(features)

    return features

################################################################################
# csv 预测结果存储
def csvResultDataSave(result, csvName):
    print('数据存储中...')
    ids = np.arange(1,28001)
    with open(csvName, 'w') as myCSV:
        myWriter = csv.writer(myCSV)
        myWriter.writerow(["ImageId", "Label"])
        myWriter.writerows(zip(ids, result))

################################################################################
# 分类模块

# 分类
def classificationDigit(X, Y, testX):

    '''
    print("维度归约...")

    # 降维 IPCA
    # IPCA + KNN[Accuracy=0.84614]
    # IPCA + RandomForest[Accuracy=0.83843]
    ipca = IncrementalPCA(n_components=90, batch_size=100)
    ipca.fit(X)

    X = ipca.fit_transform(X)
    testX = ipca.fit_transform(testX)

    print('explained variance ratio : %s, %f'
      % (str(ipca.explained_variance_ratio_), \
         min(ipca.explained_variance_ratio_)))

    '''

    # 分类器 RandomForest[Accuracy=0.96443]
    classifier = RandomForestClassifier(n_estimators=100)
    print('开始训练...')
    # 训练
    classifier.fit(X, Y)
    print('开始预测...')
    predicted = classifier.predict(testX)

    return predicted


################################################################################
# 测试对比模块
def compareResult(result):
    print("测试中...[1:1000]")
    labels = []
    with open('rf_benchmark.csv') as myCSV:
        reader = csv.reader(myCSV)
        for index, row in enumerate(reader):
            if index > 0:
                labels.append(row[1:])
            if index > 1000:
                break

    # list -> array
    labels = ravel(np.int_(labels))

    ans = np.sum((labels[:1000]-result[:1000])==0)/float(len(labels[:1000]))

    print("Accuracy is %f\n" % ans)


################################################################################
# 主模块

trainX, trainY = csvTrainDataRead()
testX = csvTestDataRead()
result = classificationDigit(trainX, trainY, testX)
csvResultDataSave(result, 'rfc.csv')
compareResult(result)

总结

First.
刚开始接触 Python，csv 读写就好了好久，由于自己下的 Python 是 3.4 版本的，网上有很多 csv 读写程序都是 2.x，因此出现了很多问题。
Second.
关于机器学习的方法，我主要采用了 scikit-learn 工具包，在它的官网有很多案例，学起来很方便。
Third.
在图像识别、语音识别、自然语言领域，深度学习方法是非常好的，Python 中 nolearn 工具包会是一个不错的选择，而且这个工具包大部分已经能够与 scikit-learn 兼容。

海上机械师

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Digit Recognizer Kaggle 竞赛系列

手写数字识别1. KNN 准确率 0.83886，KNN 算法预测花费时间很长2. IPCA + KNN 准确率 0.84614，IPCA 降维可能会超内存3. IPCA + RandomForest 准确率 0.84614，RandomForest 随机森林的效率比 KNN 高4. RandomForest，准确率 0.96443，深度学习中的卷积神经网络能达到 0.99+ 准确率的效果
复制链接

扫一扫

专栏目录