Kaggle-Digit Recognizer 不同算法的分析

最新推荐文章于 2019-09-11 15:11:21 发布

无垠无知

最新推荐文章于 2019-09-11 15:11:21 发布

阅读量325

点赞数

分类专栏： DL、ML Kaggle

本文链接：https://blog.csdn.net/weixin_43319488/article/details/83508708

版权

DL、ML 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

Kaggle

1 篇文章 0 订阅

订阅专栏

我所采用的数据集为kaggle的竞赛Digit Recognizer的数据集，
我分别采用了朴素贝叶斯、KNN、k-means、SVM、Decision tree
、Random Forest
对比了和使用了PCA降维之后的准确度的比较。
实验步骤如下：

读取数据
数据预处理
模型选择
模型优化

未使用PCA

（代码为使用PCA的，可以注释掉，进行无PCA的版本）

import os
import numpy as np
import pandas as pd
import  time
import  matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold
size_img = 28

threshold_color = 100/255
file = open('train.csv')
data_train = pd.read_csv(file)

y_train = np.array(data_train.iloc[:,0])
x_train = np.array(data_train.iloc[:,1:])


file = open('test.csv')
data_test = pd.read_csv(file)
x_test = np.array(data_test)

n_features_train = x_train.shape[1]
n_samples_train = x_train.shape[0]
n_features_test = x_test.shape[1]
n_samples_test = x_test.shape[0]
print(n_features_train,n_samples_train,n_features_test,n_samples_test)
print(x_train.shape, y_train.shape, x_test.shape)

"""
show the image 
"""
def show_img(x):
    plt.figure(figsize= (8,7))
    if x.shape[0] > 100:
        print(x.shape[0])
        n_imgs = 16
        n_samples = x.shape[0]
        x = x.reshape(n_samples,size_img,size_img)
        for i in range(n_imgs):
            plt.subplot(4,4,i+1)
            plt.imshow(x[i])
        plt.show()
    else:
        plt.imshow(x)
        plt.show()

#Normalized
def int2float_grey(x):
    x = x/255
    return x
# resize and Cropping
def find_left_edge(x):
    edge_left = []
    n_samples = x.shape[0]
    for k in range(n_samples):
        for j in range(size_img):
            for i in range(size_img):
                if x[k,size_img*i +j] >= threshold_color:
                    edge_left.append(j)
                    break
            if len(edge_left) >k :
                break
    return edge_left

def find_right_edge(x):
    edge_right = []
    n_samples = x.shape[0]
    for k in range(n_samples):
        for j in range(size_img):
            for i in range(size_img):
                if x[k,size_img*i +(size_img -j-1)] >= threshold_color:
                    edge_right.append(size_img - 1- j)
                    break
            if len(edge_right) >k :
                break
    return edge_right


def find_top_edge(x):
    edge_top = []
    n_samples = x.shape[0]
    for k in range(n_samples):
        for j in range(size_img):
            for i in range(size_img):
                if x[k,size_img*i +j] >= threshold_color:
                    edge_top.append(i)
                    break
            if len(edge_top) >k :
                break
    return edge_top

def find_bottom_edge(x):
    edge_bottom = []
    n_samples = x.shape[0]
    for k in range(n_samples):
        for j in range(size_img):
            for i in range(size_img):
                if x[k,size_img*(size_img-1 -i) +j] >= threshold_color:
                    edge_bottom.append(size_img-1 -i)
                    break
            if len(edge_bottom) >k :
                break
    return edge_bottom


from skimage import transform
def stretch_image(x):
    edge_left = find_left_edge(x)
    edge_right = find_right_edge(x)
    edge_top  = find_top_edge(x)
    edge_bottom = find_bottom_edge(x)
    n_samples = x.shape[0]
    x = x.reshape(n_samples, size_img,size_img)

    for i in range(n_samples):
        x[i] = transform.resize(x[i][edge_top[i]:edge_bottom[i]+1,edge_left[i]:edge_right[i]+1],(size_img,size_img))
    x = x.reshape(n_samples,size_img **2)
    show_img(x)

# feature selection
def get_threshold(x_train,x_test):
    selector = VarianceThreshold(threshold= 0).fit(x_train)
    x_train = selector.transform(x_train)
    x_test = selector.transform(x_test)
    print("x_train.shape:",x_train.shape)
    print("x_test.shape:", x_test.shape)
    return x_train, x_test


from sklearn.decomposition import  PCA
def get_PCA(x_train,x_test):
    pca = PCA(n_components=0.95)
    pca.fit(x_train)
    x_train = pca.transform(x_train)
    x_test = pca.transform(x_test)
    return x_train,x_test


x_train = int2float_grey(x_train)
x_test = int2float_grey(x_test)
# stretch_image(x_train)
# stretch_image(x_test)
#x_train,x_test = get_threshold(x_train,x_test)
x_train,x_test = get_PCA(x_train,x_test)  #use pca


def general_function(mod_name, model_name):
    y_pred = model_train_predict(mod_name, model_name)
    out_predict(y_pred,model_name)


from sklearn.model_selection import cross_val_score
def model_train_predict(mod_name, model_name):

    start_time = time.time()
    import_mod = __import__(mod_name, fromlist= str(True))
    if hasattr(import_mod, model_name):
        f = getattr(import_mod, model_name)
    else:
        print('404')
        return []

    clf = f()
    clf.fit(x_train,y_train)
    y_predict = clf.predict(x_train)
    print("mod_name.model_name  is : ",mod_name,model_name)
    end_time = time.time()
    print("train's time is :", end_time - start_time)
    get_acc(y_predict,y_train)
    scores = cross_val_score(clf,x_train,y_train,cv=5)
    print("Accuracy :%0.2f (+- %0.2f)"%(scores.mean(),scores.std()*2))

    y_predict = clf.predict(x_test)
    return y_predict

# get accuracy
def get_acc(y_pred,y_train):
    right_num = (y_train==y_pred).sum()
    print("acc ;" ,right_num/n_samples_train)

#output the csv
def out_predict(y_pred,model_name):
    print(y_pred)
    data_pred = {'ImageId':range(1,n_samples_test +1),'Label': y_pred}
    data_pred = pd.DataFrame(data_pred)
    data_pred.to_csv("Pre_%s.csv"%model_name,index=False)


from  sklearn.naive_bayes import  GaussianNB, MultinomialNB,BernoulliNB
mod_name = "sklearn.naive_bayes"
model_name = "GaussianNB"
general_function(mod_name,model_name)

# model_name = "MultinomialNB" #not use pca, beacuse the inpout must be non-negative
# general_function(mod_name,model_name)
#
# model_name = "BernoulliNB"
# general_function(mod_name,model_name)

from sklearn.svm import SVC
mod_name = "sklearn.svm"
model_name = "SVC"

general_function(mod_name,model_name)

from sklearn.neighbors import  KNeighborsClassifier

mod_name = "sklearn.neighbors"
model_name = "KNeighborsClassifier"
general_function(mod_name,model_name)

from sklearn.cluster import  KMeans
mod_name = "sklearn.cluster"
model_name = "KMeans"
general_function(mod_name,model_name)

from sklearn.tree import  DecisionTreeClassifier
mod_name = "sklearn.tree"
model_name = "DecisionTreeClassifier"
general_function(mod_name,model_name)

# pca
from sklearn.ensemble import  RandomForestClassifier
mod_name = "sklearn.ensemble"
model_name = "RandomForestClassifier"
general_function(mod_name,model_name)

下面为不使用PCA的结果（只进行了部分的测试集验证）：

Naive Bayes:
I use GaussianNB,on the train set, the accuracy is 0.55719047619, it takes 17.858203172683716 seconds
the accuracy is 0.56(± 0.01)
I use MultinomialNB,on the train set, the accuracy is 0.82480952381, it takes 2.062295913696289 seconds
with the k-fold ,the accuracy is 0.82(± 0.01)
I use BernoulliNB,on the train set, the accuracy is 0.834785714286, it takes 1.9217422008514404 seconds
with the k-fold ,the accuracy is 0.83(± 0.01) the test set is 0.83242
SVM
I use SVM,on the train set, the accuracy is 0.9405, it takes 15.45 minutes,with the k-fold ,the accuracy is 0.93(± 0.01) the test set is 0.93600
KNN
I use KNN,on the train set, the accuracy is 0.97142857143, it takes 38.07 minutes， with the k-fold ,the accuracy is 0.97(± 0.00) the test set is 0.97000
Kmeans
I use Kmeans,on the train set, the accuracy is 0.14830952381, it takes 1.877 minutes
with the k-fold ,the accuracy is -340919.55(± 1785.45)，the test set is 0.15028
The DecisionTree
I use DecisionTree,on the train set, the accuracy is 1.0, it takes 12.842848539352417 seconds, with the k-fold ,the accuracy is 0.85(± 0.00)
Random Forest
I use Random Forest ,on the train set, the accuracy is 0.999,with the k-fold ,the accuracy is 0.94(± 0.00)

使用PCA

Naive Bayes:
I use GaussianNB,on the train set, the accuracy is 0.860119047619, the accuracy is 0.86(± 0.01)
SVM
I use SVM,on the train set, the accuracy is ** 0.975452380952**, the accuracy is 0.96(± 0.00)
KNN
I use KNN,on the train set, the accuracy is 0.980761904762。with the k-fold ,the accuracy is 0.97(± 0.00) the test set is 0.97000
Kmeans
I use Kmeans,on the train set, the accuracy is 0.0858095238095, with the k-fold ,the accuracy is -318020.78(± 1645.13)
The DecisionTree
I use DecisionTree,on the train set, the accuracy is 1.0,
with the k-fold ,the accuracy is 0.81(± 0.01)
Random Forest
I use Random Forest ,on the train set, the accuracy is 0.999119047619, with the k-fold ,the accuracy is 0.88(± 0.01) the test set is 0.8800

可以观察通过k折交叉验证，其精度和在测试集上的精度基本一致。

算法分析：

数字识别数据是离散的，是分类问题，不是连续的，所以不要用回归方法。下面主要介绍几种算法的特点：

Naive Bayes
朴素贝叶斯属于监督学习的生成模型，能处理生成模型，可以处理多类别问题，计算量小，采用了属性条件独立性假设。对连续的属性采用了概率密度函数，采用了高斯分布。三种贝叶斯方法：
其中GaussianNB就是先验为高斯分布的朴素贝叶，MultinomialNB就是先验为多项式分布的朴素贝叶斯，而BernoulliNB就是先验为伯努利分布的朴素贝叶斯。
一般来说，如果样本特征的分布大部分是连续值，使用GaussianNB会比较好。如果如果样本特征的分大部分是多元离散值，使用MultinomialNB比较合适。而如果样本特征是二元离散值或者很稀疏的多元离散值，应该使用BernoulliNB。
SVM
SVM在中小量样本规模的时候容易得到数据和特征之间的非线性关系，可以避免使用神经网络结构选择和局部极小值问题，可解释性强，可以解决高维问题。
SVM对缺失数据敏感，对非线性问题没有通用的解决方案，核函数的正确选择不容易，计算复杂度高，主流的算法可以达到O(n2)O(n2)的复杂度，这对大规模的数据是计算量很大。
KNN
监督学习方法，对于给定样本，根据距离度量寻找训练集中与其最靠进的k个训练样本。精度高，对异常值不敏感，但是计算复杂度高。
Kmeans
无监督学习方法，主要用于连续属性的计算，容易实现，但是k值的选择很重要。
The Decision Tree
一种分类算法，计算复杂度不高，可以处理不相关的特征数据，泛化能力强，适用于离散和连续的属性值，但是容易过拟合。
Random Forest
集成学习方法，它以决策树为基学习器构建的，在决策树的训练过程引入了随机属性的选择。容易实现，计算开销小。

结果分析

没有用PCA降维的算法

算法	训练集精度	k折交叉验证	时间
GaussianNB	0.5572	0.56	17.86s
MultinomialNB	0.8248	0.82	2.06s
BernoulliNB	0.8348	0.83	1.92s
SVM	0.9405	0.93	15.45 min
KNN	0.9714	0.97	38.07 min
Kmeans	0.1483	-34091	1.82min
DecisionTree	1.0	0.85	12.84s
Random Forest	0.999	0.94	4.08s

从结果可以看出kmeans和GaussianNB结果非常差劲，表现不好，因为这两个算法适用于连续性的属性计算。在k折交叉验证中，knn表现得结果最好，但是花费的时间也是最久的，因为每一次都要计算与k个样本的距离，需要大量的算力，消耗计算资源。
我们发现决策树在训练集中的精度和k折交叉验证中的精度相差很大，原因是决策时是根据最优的属性进行分类，导致了过拟合。
表现较好的是随机森林、svm、knn。
下面我们进行了PCA聚类，除去冗余的特征，只考虑准确率。

算法	训练集精度	k折交叉验证
GaussianNB	0.8601	0.86
SVM	0.9754	0.96
KNN	0.9808	0.97
DecisionTree	1.0	0.81
Random Forest	0.9991	0.88

进行pca降维后，特征值达到了154，原始特征为784。对比没有进行pca降维的结果，我们可以发现GaussianNB、SVM均有精度的提高，knn基本不变。因为PCA除去了冗余的特征，保留了主要的特征。但是随机森林的精度下降了，因为这里和决策树一样，去除了许多属性，因为其基学习器决策树本来就是依据属性来进行决策的，现在属性减少，精度也会下降。因为随机森林就是从当前节点随机选包含k个属性的子集，在从子集中选择一个最优属性划分，现在属性减少，随机性也会减小。
GaussianNB的增加因为进行PCA将维后，使得样本投影的方差放缩到单位方差，对于贝叶斯高斯分布来说，使得数据点更比原来更“连续”了。
现在为了达到更好的结果，我对SVM做优化，来获得最优的准确性。通过PCA的维数设置，使得特征下降到了46个，得到了0.98242的精度。

无垠无知

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Kaggle-Digit Recognizer 不同算法的分析

我所采用的数据集为kaggle的竞赛Digit Recognizer的数据集，我分别采用了朴素贝叶斯、KNN、k-means、SVM、Decision tree、Random Forest对比了和使用了PCA降维之后的准确度的比较。实验步骤如下：读取数据数据预处理模型调参总结import osimport numpy as npimport pandas as pdim...
复制链接

扫一扫

专栏目录