kaggle-------DigitRecognition

这篇博客介绍了作者在kaggle上参与的Digit Recognition比赛,通过KNN算法实现手写数字识别。文章强调了数据预处理和模型训练过程,分享了使用KNN算法实现模型的思路,并提供了两个版本的代码,分别用于训练模型和对测试数据进行预测。
摘要由CSDN通过智能技术生成

kaggle上的一个比赛,主要是识别数字的,具体描述可以去官网看一下,然后里面也有数据集,可以下载下来在本地跑,也可以在kaggle的平台上跑,注意去setting下面把GPU的开关打开~~
这里

我之前也写过手写体识别的,在这里 ,用的是logistic regression+softmax 数据集是放在两个文件夹下面,没一个样本是存成一个txt文件的,就是这里导入数据集比较麻烦。
现在的数据要比之前大很多,而且存的是像素值,不过实质是一样的都可以作为特征训练,我这次使用的knn,,没有调现成的包,如果你是想了解算法的原理,这篇文章也许会对你有帮助~~

knn的原理我这里就不再赘述了,思想都在代码中体现了,我最开始建立一个模型,就是把训练数据集拆分80%训练,20%测试,结果预测准确率82%:
这样
然后我使用了前500条数据来跑,然后对测试数据也就预测了100条,写到preditions.csv文件中,结果如下:
图1

然后对于全部数据是在kaggle上面跑的~跑了好久还没出结果~~~
图2

接下来我把两个版本的代码都放上来:
第一个是训练模型的,没有对测试集合进行预测,主要是看模型的好坏~:

# -*- coding: utf-8 -*-
"""
Created on Fri May 11 20:38:13 2018

@author: xuanxuan
"""
from itertools import islice 
import numpy as np
from sklearn.cross_validation import train_test_split

def load_data():
    filename="E:/pyhtonworkspace/py3-pratice/bymyself_practice/python_game/Data/Kaggel/digit-recognizer/digit-recognizer/train.csv"
    file=open(filename)   
    data=[]
    label=[]
    for line in islice(file,1, None):  
        data_line=[]
        line=line.strip().split(',')
        for num in line:
            data_line.append(int(num))
        data.append(data_line[1:])
        label.append(data_line[0])
    data_mat,label_mat=np.mat(data),np.mat(label).T
    #print(data_mat.shape)  #(42000,784)
    #print(data_mat[:20])
    #print(label_mat.shape) #(42000,1)
    #print(label_mat[:20])
    #return data_mat,label_mat
    return data_mat[:500],label_mat[:500]  #只取其中的500条数据来跑  准确率82%

#计算训练数据集和测试数据集中两个样本之间的距离
def cal_dis(test_data_line,train_data_line,n):
    dists=0
    for i in range(n):
        dists+=(test_data_line[0,i]-train_data_line[0,i])**2
    dists=np.sqrt(dists)
    return dists



def pred_test_label(d,k):
    d_arr=sorted(d.items(),key=lambda x:x[0])  #按照key(距离)排序(小-->大)
    d_new={}   #存放前k个距离最小的label和对应的数目,key-value:label-num
    for i in range(k):
        key_label=d_arr[i][1]
        if key_label not in d_new:
            d_new[key_label]=0
        else:
            d_new[key_label]+=1  #把前k个重复出现的label累加
    d_new_arr=sorted(d_new.items(),key=lambda x:x[1],reverse=True)  #对于前k个 按照标签出现的数目从大到小排序
    pred_label=d_new_arr[0][0]   #最终预测的那个标签
    return pred_label


def knn(train_data,train_label,test_data,test_label,k=12):
    m_test,n=np.shape(test_data)
    m_train=np.shape(train_data)[0]

    num_error=0  #统计预测错误个数
    for i in range(m_test):
        d={}   #测试数据集每一个样本和训练数据集所有样本之间的距离存成字典:key-value:dists-label(train)
        for j in range(m_train):
            dists=cal_dis(test_data[i],train_data[j],n)
            d[dists]=train_label[j,0]

        #至此对于测试集中的每一个样本与训练样本集合所有样本之间的距离都存起来了

        pred_label=pred_test_label(d,k) 
        if pred_label!=test_label[i,0]:
            num_error+=1
    accuracy=1-num_error/m_test
    return accuracy




if __name__=="__main__":
    data,label=load_data() 
    train_data,test_data,train_label,test_label=train_test_split(data,label,test_size=0.25,random_state=33)
    accuracy=knn(train_data,train_label,test_data,test_label)
    print("使用knn预测的准确率为:{}".format(accuracy))


第二个版本是把所有样本拿来训练,对测试数据进行预测:

# -*- coding: utf-8 -*-
"""
Created on Fri May 11 20:38:13 2018

@author: xuanxuan
"""
from itertools import islice 
import numpy as np
from sklearn.cross_validation import train_test_split

def load_data():
    filename="E:/pyhtonworkspace/py3-pratice/bymyself_practice/python_game/Data/Kaggel/digit-recognizer/digit-recognizer/train.csv"
    file=open(filename)   
    data=[]
    label=[]
    for line in islice(file,1, None):  
        data_line=[]
        line=line.strip().split(',')
        for num in line:
            data_line.append(int(num))
        data.append(data_line[1:])
        label.append(data_line[0])
    data_mat,label_mat=np.mat(data),np.mat(label).T
    #print(data_mat.shape)  #(42000,784)
    #print(data_mat[:20])
    #return data_mat,label_mat
    return data_mat[:500] ,label_mat[:500]

def load_data_test():
    filename="E:/pyhtonworkspace/py3-pratice/bymyself_practice/python_game/Data/Kaggel/digit-recognizer/digit-recognizer/test.csv"
    file=open(filename)   
    data=[]
    label=[]
    for line in islice(file,1, None):  
        data_line=[]
        line=line.strip().split(',')
        for num in line:
            data_line.append(int(num))
        data.append(data_line[:])
    data_mat=np.mat(data)
    #print(data_mat.shape)  #(42000,784)
    #print(data_mat[:20])
    #return data_mat
    return data_mat[:100]

#计算训练数据集和测试数据集中两个样本之间的距离
def cal_dis(test_data_line,train_data_line,n):
    dists=0
    for i in range(n):
        dists+=(test_data_line[0,i]-train_data_line[0,i])**2
    dists=np.sqrt(dists)
    return dists



def pred_test_label(d,k):
    d_arr=sorted(d.items(),key=lambda x:x[0])  #按照key(距离)排序(小-->大)
    d_new={}   #存放前k个距离最小的label和对应的数目,key-value:label-num
    for i in range(k):
        key_label=d_arr[i][1]
        if key_label not in d_new:
            d_new[key_label]=0
        else:
            d_new[key_label]+=1  #把前k个重复出现的label累加
    d_new_arr=sorted(d_new.items(),key=lambda x:x[1],reverse=True)  #对于前k个 按照标签出现的数目从大到小排序
    pred_label=d_new_arr[0][0]   #最终预测的那个标签
    return pred_label


def knn(train_data,train_label,test_data,k=12):
    m_test,n=np.shape(test_data)
    m_train=np.shape(train_data)[0]
    predictions=[]  #存储预测的标签
    for i in range(m_test):
        d={}   #测试数据集每一个样本和训练数据集所有样本之间的距离存成字典:key-value:dists-label(train)
        for j in range(m_train):
            dists=cal_dis(test_data[i],train_data[j],n)
            d[dists]=train_label[j,0]

        #至此对于测试集中的每一个样本与训练样本集合所有样本之间的距离都存起来了

        pred_label=pred_test_label(d,k) 
        predictions.append(pred_label)
    #print(predictions)
    return predictions

def get_result(predictions):
    out_file = open("E:/pyhtonworkspace/py3-pratice/bymyself_practice/python_game/Data/Kaggel/digit-recognizer/digit-recognizer/predictions.csv", "w")
    out_file.write("ImageId,Label\n")
    for i in range(len(predictions)):
        out_file.write(str(i+1) + "," + str(int(predictions[i])) + "\n")
    out_file.close()

if __name__=="__main__":
    train_data,train_label=load_data()
    test_data=load_data_test()
    predictions=knn(train_data,train_label,test_data)
    get_result(predictions)






Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural network that operates directly on the image pixels. We employ the DistBelief (Dean et al., 2012) implementation of deep neural networks in order to train large, distributed neural networks on high quality images. We find that the performance of this approach increases with the depth of the convolutional network, with the best performance occurring in the deepest architecture we trained, with eleven hidden layers. We evaluate this approach on the publicly available SVHN dataset and achieve over 96% accuracy in recognizing complete street numbers. We show that on a per-digit recognition task, we improve upon the state-of-theart, achieving 97.84% accuracy. We also evaluate this approach on an even more challenging dataset generated from Street View imagery containing several tens of millions of street number annotations and achieve over 90% accuracy. To further explore the applicability of the proposed system to broader text recognition tasks, we apply it to transcribing synthetic distorted text from a popular CAPTCHA service, reCAPTCHA. reCAPTCHA is one of the most secure reverse turing tests that uses distorted text as one of the cues to distinguish humans from bots. With the proposed approach we report a 99.8% accuracy on transcribing the hardest category of reCAPTCHA puzzles. Our evaluations on both tasks, the street number recognition as well as reCAPTCHA puzzle transcription, indicate that at specific operating thresholds, the performance of the proposed system is comparable to, and in some cases exceeds, that of human operators.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值