# Classify handwritten digits using the famous MNIST data

This competition is the first in a series of tutorial competitions designed to introduce people to Machine Learning.

The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.  As the competition progresses, we will release tutorials which explain different machine learning algorithms and help you to get started.

The data for this competition were taken from the MNIST dataset. The MNIST ("Modified National Institute of Standards and Technology") dataset is a classic within the Machine Learning community that has been extensively studied.  More detail about the dataset, including Machine Learning algorithms that have been tried on it and their levels of success, can be found at http://yann.lecun.com/exdb/mnist/index.html.

In [1]:
pwd
C:\Users\zhaohf\Desktop

In [5]:
cd ../../../workspace/kaggle/DigitRecognizer/Data/
C:\workspace\kaggle\DigitRecognizer\Data

In [6]:
ls
 驱动器 C 中的卷是 OS
卷的序列号是 6C93-0DF3

C:\workspace\kaggle\DigitRecognizer\Data 的目录

2015/01/15  16:04    <DIR>          .
2015/01/15  16:04    <DIR>          ..
2014/12/28  15:06           240,909 rf_benchmark.csv
2015/01/15  16:04        51,118,294 test.csv
2014/12/28  15:06        51,118,296 test.csv.bak
2014/12/28  15:06        76,775,041 train.csv
4 个文件    179,252,540 字节
2 个目录 105,536,135,168 可用字节

In [7]:
import pandas as pd
df = pd.read_csv('train.csv',header=0).head() #只要前5行
In [8]:
df
Out[8]:
label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 785 columns

In [9]:
df['label']
Out[9]:
0    1
1    0
2    1
3    4
4    0
Name: label, dtype: int64
In [14]:
df = df.ix[:,'pixel0':] #去除label列
In [15]:
df
Out[15]:
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 784 columns

In [21]:
%matplotlib inline
import matplotlib.pyplot as plt
for i in range(df.shape[0]):
    img = df.ix[i].values.reshape((28,28))
    plt.subplot(2,5,i+1)
    plt.imshow(img)

import numpy as np
from sklearn.ensemble import RandomForestClassifier

X_train = np.array([x[1:] for x in train])
print X_train.shape
Y_train = np.array([x[0] for x in train])
print Y_train.shape
print X_test.shape
print 'Training...'
rf = RandomForestClassifier(n_estimators=100)
print 'Predicting...'
rf_model = rf.fit(X_train,Y_train)
pred = [[index+1,x] for index,x in enumerate(rf_model.predict(X_test))]
print 'Done.'