Digit Recognizer from kaggle
link: https://www.kaggle.com/c/digit-recognizer
Digit Recognizer是kaggle上很基本的一道题目。
数据集描述:
The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.
Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.
The training data set, (train.csv), has 785 columns. The first column, called “label”, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.
Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).
首先查看下数据集
#coding = utf8
%matplotlib inline
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
def opencsv(): # open with pandas
data = pd.read_csv('data/train.csv')
data1 = pd.read_csv('data/test.csv')
train_data = data.values[0:, 1:] # 读入全部训练数据
train_label = data.values[0:, 0]
test_data = data1.values[0:, 0:] # 测试全部测试个数据
print 'Data Load Done!'
return train_data, train_label, test_data
train_data, train_label, test_data = opencsv()
# Train_data 中存储了训练集的784个特征,Test_data存储了测试集的784个特征,train_lable则存储了训练集的标签
# 可以看出这道题是典型的监督学习问题
Data Load Done!
import matplotlib.pyplot as plt
from numpy import *
print shape(train_data),shape(test_data) #训练集有42000个。测试集有28000个
def showPic(data):
plt.figure(figsize=(7,7))
# 查看前70幅图
for digit_num in range(0,70):
plt.subplot(7,10,digit_num+1)
grid_data = data[digit_num].r