Kaggle项目实战1——Digit Recognizer——排名Top10%

最新推荐文章于 2024-08-13 08:48:45 发布

KagglerWu

最新推荐文章于 2024-08-13 08:48:45 发布

阅读量1.6w

点赞数 9

分类专栏： Kaggle 文章标签： Kaggle 机器学习深度学习 machine learning 数据挖掘

本文链接：https://blog.csdn.net/u013691510/article/details/43195227

版权

这篇博客介绍了作者在Kaggle上的Digit Recognizer项目实战，通过Random Forest、Multi-Layer-Perceptron和LeNet5三种方法进行数字识别，旨在提升模型的测试集正确率并分享解决方案的开源Python代码。

摘要由CSDN通过智能技术生成

一、kaggle介绍

Kaggle是一个大数据的众包平台，也是一个很好的项目实践场所。Kaggle的项目分为练习项目和奖励项目。今天写的Digit Recognizer属于练习项目，最后的结果只按照测试集的正确率计算排名，没有奖励。解决方案的python代码在Github开源平台上。

二、Digit Recognizer任务

此任务是在MNIST（一个带Label的数字像素集合）上训练一个数字分类器，训练集的大小为42000个training example，每个example是28*28=784个灰度像素值和一个0~9的label。最后的排名以在测试集合上的分类正确率为依据排名。

三、工具准备

这个项目在python环境下做，需要安装的库有ipython，numpy，matplotlib，pandas，scikit-learn，theano。在window环境下其中ipython，numpy，matplotlib，pandas库可以通过安装Canopy集成开发环境完成并且省去了环境变量的配置。Canpy的下载在这里，在校学生注册认证可以获得学术版本的Canopy。ipython负责提供交互式的开发调试环境，numpy负责科学计算库，matplotlib负责绘图和数据可视化，pandas库负责数据的转换和导入导出。scikit-learn是python下比较成熟的机器学习库，其文档多而规范。theano库是deeplearning的库，在Canopy中安装theano库需要在Canopy的Package Manager中事先安装libpython和mingw。

四、第一次尝试：Random Forest

一看是分类问题，第一个想法就是利用随机森林。随机森林本质是一种bagging的方法，其bagging的对象是（层数比较深，叶节点上example比较少的）决策树分类器，既然层数比较深所以带有比较多的variance，bagging方法通过两次Random取样降低了variance：第一次是Random地取training example来构建独立的决策树，第二次是在决策树选split point时Random地取一些feature的子集。通过这两次Random就使得每棵决策树Remember了training example的一部分而不是整个的训练集，带来的bias就trade off了深层决策树带来的high variance。随机森林算法的优点是：1、既可以包含binary的feature也可以同时包含标量的feature。2、算法hyperparameters比较少，参数的选取调试对结果的影响不大。3、算法的精度较高。随机森林算法最突出的缺点是由于使用了ensemble，所以算法的速度比较慢。

一般的数据挖掘问题feature engineering十分重要，但是在这个例子里面，给的数据集已经是规范的按像素排列的灰度值，并不需要过多的处理。首先读入数据：

def readCSVFile(file):
    rawData=[]
    trainFile=open(path+file,'rb')
    reader=csv.reader(trainFile)
    for line in reader:
        rawData.append(line)#42001 lines,the first line is header
    rawData.pop(0)#remove header
    intData=np.array(rawData).astype(np.int32)
    return intData
    
def loadTrainingData():
    intData=readCSVFile("train.csv")
    label=intData[:,0]
    data=intData[:,1:]
    data=np.where(data>0,1,0)#replace positive in feature vector to 1
    return data,label

def loadTestData():
    intData=readCSVFile("test.csv")
    data=np.where(intData>0,1,0)
    return data

然后通过scikit-learn的RandomForestClassifier来训练分类器：

def handwritingClassTest():
    #load data and normalization
    trainData,trainLabel=loadTrainingData()
    testData=loadTestData()
    testLabel=loadTestResult()
    #train the rf classifier
    clf=RandomForestClassifier(n_estimators=1000,min_samples_split=5)
    clf=clf.fit(trainData,trainLabel)#train 20 objects
    m,n=np.shape(testData)
    errorCount=0
    resultList=[]
    for i in range(m):#test 5 objects
         classifierResult = clf.predict(testData[i])
         resultList.append(classifierResult)
    saveResult(resultList)

其中的hyperparameter是通过crossvaildation选出来的，最后输出到csv文件并submit，正确率为96.3%，也不是非常糟糕。

五、第二次尝试：Multi-Layer-Perceptron

由于是图像问题，意识到神经网络应该会有不错的表现，为了防止overfitting，我们先构造一个比较浅层的网络分类器。这里用theano库做一个一层感知机级联一个softmax分类器，为了以后的程序复用性最好把感知机和softmax分类器分别封装成类。首先用pandas库读入DataFrame，并且把42000个example分裂为35000+7000，其中的35000个用于训练，另外的7000个用于validation set（用于early-stop防止overfitting，后面细说）。

def shared_dataset(data_xy,borrow=True):
    """
    speed up the calculation by theano,in GPU float computation.
    """
    data_x,data_y=data_xy
    shared_x=theano.shared(np.asarray(data_x,dtype=theano.config.floatX),borrow=borrow)
    shared_y=theano.shared(np.asarray(data_y,dtype=theano.config.floatX),borrow=borrow)
    # When storing data on the GPU it has to be stored as floats
    # therefore we will store the labels as ``floatX`` as well
    # (``shared_y`` does exactly that). But during our computations
    # we need them as ints (we use labels as index, and if they are
    # floats it doesn't make sense) therefore instead of returning
    # `shared_y`` we will have to cast it to int. This little hack
    # lets ous get around this issue
    return shared_x,T.cast(shared_y,'int32')

def load_data(path):
    print '...loading data'
    train_df=DataFrame.from_csv(path+'train.csv',index_col=False).fillna(0).astype(int)
    test_df=DataFrame.from_csv(path+'test.csv',index_col=False).fillna(0).astype(int)
    if debug_mode==False:
        train_set=[train_df.values[0:35000,1:]/255.0,train_df.values[0:35000,0]]
        valid_set=[train_df.values[35000:,1:]/255.0,train_df.values[35000:,0]]
    else:
        train_set=[train_df.values[0:3500,1:]/255.0,train_df.values[0:3500,0]]
        valid_set=[train_df.values[3500:4000,1:]/255.0,train_df.values[3500:4000,0]]
    test_set=test_df.values/255.0
    #print train_set[0][:10][:10],'\n',train_set[1][:10],'\n',valid_set[0][-10:][:10],'\n',valid_set[1][-10:],'\n',test_set[0][10:][:10]
    test_set_x=theano.shared(np.asarray(test_set,dtype=theano.config.floatX),borrow=True)
    valid_set_x,valid_set_y=shared_dataset(valid_set,borrow=True)
    train_set_x,train_set_y=shared_dataset(train_set,borrow=True)
    rval=[(train_set_x,train_set_y),(valid_set_x,valid_set_y),test_set_x]
    return rval