Digit Recognizer Kaggle 竞赛系列

手写数字识别

数据来源 Kaggle 中的竞赛项目 Digit Recognizer
项目报告已委托在我的 github 上。
Kaggle 的入门可以参照作者 wphh 的博文。


案例分析

数据文件

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

训练集

The training data set, (train.csv), has 785 columns. The first column, called “label”, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).

For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.

Visually, if we omit the “pixel” prefix, the pixels make up the image like this:

< Input Data:
000 001 002 003 … 026 027
028 029 030 031 … 054 055
056 057 058 059 … 082 083
| | | | … | |
728 729 730 731 … 754 755
756 757 758 759 … 782 783
>

测试集

The test data set, (test.csv), is the same as the training set, except that it does not contain the “label” column.

Your submission file should be in the following format: For each of the 28000 images in the test set, output a single line with the digit you predict. For example, if you predict that the first image is of a 3, the second image is of a 7, and the third image is of a 8, then your submission file would look like:

< Output Data
3
7
8
(27997 more lines)
>

The evaluation metric for this contest is the categorization accuracy, or the proportion of test images that are correctly classified. For example, a categorization accuracy of 0.97 indicates that you have correctly classified all but 3% of the images.


解题报告

尝试了 K阶近邻、决策树、随机森林等方法,最终最好的结果是

RandomForestClassifier(n_estimators=100), Accuracy = 0.96443

当然,主成分分析也被用来降维提高效率,不过结果并不理想。

测试结果如下:

  1. KNN:Accuracy = 0.83886
    KNN 算法在预测过程中花费的时间很长

  2. IPCA + KNN:Accuracy = 0.84614
    IPCA 降维可能会遇到超出内存的问题
    注:测试程序的电脑内存是 8GB

  3. IPCA + RandomForest:Accuracy = 0.83843
    RandomForest 随机森林的效率显然比 KNN 高很多,训练与预测效果都非常好

  4. RandomForest:Accuracy = 0.96443
    随机森林是 4 种算法中效果最好的,但同时从论坛(Forum)中得到的信息,
    深度学习中的卷积神经网络能达到 Accuracy = 0.99+ 的效果

详细代码如下:

import csv
import numpy as np
from numpy import ravel
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn import datasets,  metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA, IncrementalPCA

################################################################################
# 数据读取模块

# 读取 csv 数据 train
def csvTrainDataRead():
    print("Load train data...")
    features, labels = [], []
    with open('train.csv') as myCSV:
        reader = csv.reader(myCSV)
        for index, row in enumerate(reader):
            if index > 0:
                labels.append(row[:1])
                features.append(row[1:])

    # list -> array
    features = np.float_(features)
    labels = ravel(np.int_(labels))

    return features, labels

# 读取 csv 数据 test
def csvTestDataRead():
    print("Load test data...")
    features = []
    with open('test.csv') as myCSV:
        reader = csv.reader(myCSV)
        for index, row in enumerate(reader):
            if index > 0:
                features.append(row)
    # list -> array.float
    features = np.float_(features)

    return features

################################################################################
# csv 预测结果存储
def csvResultDataSave(result, csvName):
    print('数据存储中...')
    ids = np.arange(1,28001)
    with open(csvName, 'w') as myCSV:
        myWriter = csv.writer(myCSV)
        myWriter.writerow(["ImageId", "Label"])
        myWriter.writerows(zip(ids, result))

################################################################################
# 分类模块

# 分类
def classificationDigit(X, Y, testX):

    '''
    print("维度归约...")

    # 降维 IPCA
    # IPCA + KNN[Accuracy=0.84614]
    # IPCA + RandomForest[Accuracy=0.83843]
    ipca = IncrementalPCA(n_components=90, batch_size=100)
    ipca.fit(X)

    X = ipca.fit_transform(X)
    testX = ipca.fit_transform(testX)

    print('explained variance ratio : %s, %f'
      % (str(ipca.explained_variance_ratio_), \
         min(ipca.explained_variance_ratio_)))

    '''

    # 分类器 RandomForest[Accuracy=0.96443]
    classifier = RandomForestClassifier(n_estimators=100)
    print('开始训练...')
    # 训练
    classifier.fit(X, Y)
    print('开始预测...')
    predicted = classifier.predict(testX)

    return predicted


################################################################################
# 测试对比模块
def compareResult(result):
    print("测试中...[1:1000]")
    labels = []
    with open('rf_benchmark.csv') as myCSV:
        reader = csv.reader(myCSV)
        for index, row in enumerate(reader):
            if index > 0:
                labels.append(row[1:])
            if index > 1000:
                break

    # list -> array
    labels = ravel(np.int_(labels))

    ans = np.sum((labels[:1000]-result[:1000])==0)/float(len(labels[:1000]))

    print("Accuracy is %f\n" % ans)


################################################################################
# 主模块

trainX, trainY = csvTrainDataRead()
testX = csvTestDataRead()
result = classificationDigit(trainX, trainY, testX)
csvResultDataSave(result, 'rfc.csv')
compareResult(result)

总结

First.
刚开始接触 Python,csv 读写就好了好久,由于自己下的 Python 是 3.4 版本的,网上有很多 csv 读写程序都是 2.x,因此出现了很多问题。
Second.
关于机器学习的方法,我主要采用了 scikit-learn 工具包,在它的官网有很多案例,学起来很方便。
Third.
在图像识别、语音识别、自然语言领域,深度学习方法是非常好的,Python 中 nolearn 工具包会是一个不错的选择,而且这个工具包大部分已经能够与 scikit-learn 兼容。
  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值