Kaggle-Digit Recognizer 不同算法的分析

我所采用的数据集为kaggle的竞赛Digit Recognizer的数据集,
我分别采用了朴素贝叶斯、KNN、k-means、SVM、Decision tree
、Random Forest
对比了和使用了PCA降维之后的准确度的比较。
实验步骤如下:

  • 读取数据
  • 数据预处理
  • 模型选择
  • 模型优化

未使用PCA

(代码为使用PCA的,可以注释掉,进行无PCA的版本)

import os
import numpy as np
import pandas as pd
import  time
import  matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold
size_img = 28

threshold_color = 100/255
file = open('train.csv')
data_train = pd.read_csv(file)

y_train = np.array(data_train.iloc[:,0])
x_train = np.array(data_train.iloc[:,1:])


file = open('test.csv')
data_test = pd.read_csv(file)
x_test = np.array(data_test)

n_features_train = x_train.shape[1]
n_samples_train = x_train.shape[0]
n_features_test = x_test.shape[1]
n_samples_test = x_test.shape[0]
print(n_features_train,n_samples_train,n_features_test,n_samples_test)
print(x_train.shape, y_train.shape, x_test.shape)

"""
show the image 
"""
def show_img(x):
    plt.figure(figsize= (8,7))
    if x.shape[0] > 100:
        print(x.shape[0])
        n_imgs = 16
        n_samples = x.shape[0]
        x = x.reshape(n_samples,size_img,size_img)
        for i in range(n_imgs):
            plt.subplot(4,4,i+1)
            plt.imshow(x[i])
        plt.show()
    else:
        plt.imshow(x)
        plt.show()

#Normalized
def int2float_grey(x):
    x = x/255
    return x
# resize and Cropping
def find_left_edge(x):
    edge_left = []
    n_samples = x.shape[0]
    for k in range(n_samples):
        for j in range(size_img):
            for i in range(size_img):
                if x[k,size_img*i +j] >= threshold_color:
                    edge_left.append(j)
                    break
            if len(edge_left) >k :
                break
    return edge_left

def find_right_edge(x):
    edge_right = []
    n_samples = x.shape[0]
    for k in range(n_samples):
        for j in range(size_img):
            for i in range(size_img):
                if x[k,size_img*i +(size_img -j-1)] >= threshold_color:
                    edge_right.append(size_img - 1- j)
                    break
            if len(edge_right) >k :
                break
    return edge_right


def find_top_edge(x):
    edge_top = []
    n_samples = x.shape[0]
    for k in range(n_samples):
        for j in range(size_img):
            for i in range(size_img):
                if x[k,size_img*i +j] >= threshold_color:
                    edge_top.append(i)
                    break
            if len(edge_top) >k :
                break
    return edge_top

def find_bottom_edge(x):
    edge_bottom = []
    n_samples = x.shape[0]
    for k in range(n_samples):
        for j in range(size_img):
            for i in range(size_img):
                if x[k,size_img*(size_img-1 -i) +j] >= threshold_color:
                    edge_bottom.append(size_img-1 -i)
                    break
            if len(edge_bottom) >k :
                break
    return edge_bottom


from skimage import transform
def stretch_image(x):
    edge_left = find_left_edge(x)
    edge_right = find_right_edge(x)
    edge_top  = find_top_edge(x)
    edge_bottom = find_bottom_edge(x)
    n_samples = x.shape[0]
    x = x.reshape(n_samples, size_img,size_img)

    for i in range(n_samples):
        x[i] = transform.resize(x[i][edge_top[i]:edge_bottom[i]+1,edge_left[i]:edge_right[i]+1],(size_img,size_img))
    x = x.reshape(n_samples,size_img **2)
    show_img(x)

# feature selection
def get_threshold(x_train,x_test):
    selector = VarianceThreshold(threshold= 0).fit(x_train)
    x_train = selector.transform(x_train)
    x_test = selector.transform(x_test)
    print("x_train.shape:",x_train.shape)
    print("x_test.shape:", x_test.shape)
    return x_train, x_test


from sklearn.decomposition import  PCA
def get_PCA(x_train,x_test):
    pca = PCA(n_components=0.95)
    pca.fit(x_train)
    x_train = pca.transform(x_train)
    x_test = pca.transform(x_test)
    return x_train,x_test


x_train = int2float_grey(x_train)
x_test = int2float_grey(x_test)
# stretch_image(x_train)
# stretch_image(x_test)
#x_train,x_test = get_threshold(x_train,x_test)
x_train,x_test = get_PCA(x_train,x_test)  #use pca


def general_function(mod_name, model_name):
    y_pred = model_train_predict(mod_name, model_name)
    out_predict(y_pred,model_name)


from sklearn.model_selection import cross_val_score
def model_train_predict(mod_name, model_name):

    start_time = time.time()
    import_mod = __import__(mod_name, fromlist= str(True))
    if hasattr(import_mod, model_name):
        f = getattr(import_mod, model_name)
    else:
        print('404')
        return []

    clf = f()
    clf.fit(x_train,y_train)
    y_predict = clf.predict(x_train)
    print("mod_name.model_name  is : ",mod_name,model_name)
    end_time = time.time()
    print("train's time is :", end_time - start_time)
    get_acc(y_predict,y_train)
    scores = cross_val_score(clf,x_train,y_train,cv=5)
    print("Accuracy :%0.2f (+- %0.2f)"%(scores.mean(),scores.std()*2))

    y_predict = clf.predict(x_test)
    return y_predict

# get accuracy
def get_acc(y_pred,y_train):
    right_num = (y_train==y_pred).sum()
    print("acc ;" ,right_num/n_samples_train)

#output the csv
def out_predict(y_pred,model_name):
    print(y_pred)
    data_pred = {'ImageId':range(1,n_samples_test +1),'Label': y_pred}
    data_pred = pd.DataFrame(data_pred)
    data_pred.to_csv("Pre_%s.csv"%model_name,index=False)


from  sklearn.naive_bayes import  GaussianNB, MultinomialNB,BernoulliNB
mod_name = "sklearn.naive_bayes"
model_name = "GaussianNB"
general_function(mod_name,model_name)

# model_name = "MultinomialNB" #not use pca, beacuse the inpout must be non-negative
# general_function(mod_name,model_name)
#
# model_name = "BernoulliNB"
# general_function(mod_name,model_name)

from sklearn.svm import SVC
mod_name = "sklearn.svm"
model_name = "SVC"

general_function(mod_name,model_name)

from sklearn.neighbors import  KNeighborsClassifier

mod_name = "sklearn.neighbors"
model_name = "KNeighborsClassifier"
general_function(mod_name,model_name)

from sklearn.cluster import  KMeans
mod_name = "sklearn.cluster"
model_name = "KMeans"
general_function(mod_name,model_name)

from sklearn.tree import  DecisionTreeClassifier
mod_name = "sklearn.tree"
model_name = "DecisionTreeClassifier"
general_function(mod_name,model_name)

# pca
from sklearn.ensemble import  RandomForestClassifier
mod_name = "sklearn.ensemble"
model_name = "RandomForestClassifier"
general_function(mod_name,model_name)

下面为不使用PCA的结果(只进行了部分的测试集验证):

  • Naive Bayes:
    I use GaussianNB,on the train set, the accuracy is 0.55719047619, it takes 17.858203172683716 seconds
    the accuracy is 0.56(± 0.01)
    I use MultinomialNB,on the train set, the accuracy is 0.82480952381, it takes 2.062295913696289 seconds
    with the k-fold ,the accuracy is 0.82(± 0.01)
    I use BernoulliNB,on the train set, the accuracy is 0.834785714286, it takes 1.9217422008514404 seconds
    with the k-fold ,the accuracy is 0.83(± 0.01) the test set is 0.83242
  • SVM
    I use SVM,on the train set, the accuracy is 0.9405, it takes 15.45 minutes,with the k-fold ,the accuracy is 0.93(± 0.01) the test set is 0.93600
  • KNN
    I use KNN,on the train set, the accuracy is 0.97142857143, it takes 38.07 minutes, with the k-fold ,the accuracy is 0.97(± 0.00) the test set is 0.97000
  • Kmeans
    I use Kmeans,on the train set, the accuracy is 0.14830952381, it takes 1.877 minutes
    with the k-fold ,the accuracy is -340919.55(± 1785.45),the test set is 0.15028
  • The DecisionTree
    I use DecisionTree,on the train set, the accuracy is 1.0, it takes 12.842848539352417 seconds, with the k-fold ,the accuracy is 0.85(± 0.00)
  • Random Forest
    I use Random Forest ,on the train set, the accuracy is 0.999,with the k-fold ,the accuracy is 0.94(± 0.00)

使用PCA

  • Naive Bayes:
    I use GaussianNB,on the train set, the accuracy is 0.860119047619, the accuracy is 0.86(± 0.01)

  • SVM
    I use SVM,on the train set, the accuracy is ** 0.975452380952**, the accuracy is 0.96(± 0.00)

  • KNN
    I use KNN,on the train set, the accuracy is 0.980761904762。with the k-fold ,the accuracy is 0.97(± 0.00) the test set is 0.97000

  • Kmeans
    I use Kmeans,on the train set, the accuracy is 0.0858095238095, with the k-fold ,the accuracy is -318020.78(± 1645.13)

  • The DecisionTree
    I use DecisionTree,on the train set, the accuracy is 1.0,
    with the k-fold ,the accuracy is 0.81(± 0.01)

  • Random Forest
    I use Random Forest ,on the train set, the accuracy is 0.999119047619, with the k-fold ,the accuracy is 0.88(± 0.01) the test set is 0.8800

可以观察 通过k折交叉验证,其精度和在测试集上的精度基本一致。

算法分析:

数字识别数据是离散的,是分类问题,不是连续的,所以不要用回归方法。下面主要介绍几种算法的特点:

  • Naive Bayes
    朴素贝叶斯属于监督学习的生成模型,能处理生成模型,可以处理多类别问题,计算量小,采用了属性条件独立性假设。对连续的属性采用了概率密度函数,采用了高斯分布。三种贝叶斯方法:
    其中GaussianNB就是先验为高斯分布的朴素贝叶,MultinomialNB就是先验为多项式分布的朴素贝叶斯,而BernoulliNB就是先验为伯努利分布的朴素贝叶斯。
    一般来说,如果样本特征的分布大部分是连续值,使用GaussianNB会比较好。如果如果样本特征的分大部分是多元离散值,使用MultinomialNB比较合适。而如果样本特征是二元离散值或者很稀疏的多元离散值,应该使用BernoulliNB。

  • SVM
    SVM在中小量样本规模的时候容易得到数据和特征之间的非线性关系,可以避免使用神经网络结构选择和局部极小值问题,可解释性强,可以解决高维问题。
    SVM对缺失数据敏感,对非线性问题没有通用的解决方案,核函数的正确选择不容易,计算复杂度高,主流的算法可以达到O(n2)O(n2)的复杂度,这对大规模的数据是计算量很大。

  • KNN
    监督学习方法,对于给定样本,根据距离度量寻找训练集中与其最靠进的k个训练样本。精度高,对异常值不敏感,但是计算复杂度高。

  • Kmeans
    无监督学习方法,主要用于连续属性的计算,容易实现,但是k值的选择很重要。

  • The Decision Tree
    一种分类算法, 计算复杂度不高,可以处理不相关的特征数据,泛化能力强,适用于离散和连续的属性值,但是容易过拟合。

  • Random Forest
    集成学习方法,它以决策树为基学习器构建的,在决策树的训练过程引入了随机属性的选择。容易实现,计算开销小。

结果分析

没有用PCA降维的算法

算法训练集精度k折交叉验证时间
GaussianNB0.55720.5617.86s
MultinomialNB0.82480.822.06s
BernoulliNB0.83480.831.92s
SVM0.94050.9315.45 min
KNN0.97140.9738.07 min
Kmeans0.1483-340911.82min
DecisionTree1.00.8512.84s
Random Forest0.9990.944.08s

从结果可以看出kmeans和GaussianNB结果非常差劲,表现不好,因为这两个算法适用于连续性的属性计算。在k折交叉验证中,knn表现得结果最好,但是花费的时间也是最久的,因为每一次都要计算与k个样本的距离,需要大量的算力,消耗计算资源。
我们发现决策树在训练集中的精度和k折交叉验证中的精度相差很大,原因是决策时是根据最优的属性进行分类,导致了过拟合。
表现较好的是随机森林、svm、knn。
下面我们进行了PCA聚类,除去冗余的特征,只考虑准确率。

算法训练集精度k折交叉验证
GaussianNB0.86010.86
SVM0.97540.96
KNN0.98080.97
DecisionTree1.00.81
Random Forest0.99910.88

进行pca降维后,特征值达到了154,原始特征为784。对比没有进行pca降维的结果,我们可以发现GaussianNB、SVM均有精度的提高,knn基本不变。因为PCA除去了冗余的特征,保留了主要的特征。但是随机森林的精度下降了,因为这里和决策树一样,去除了许多属性,因为其基学习器决策树本来就是依据属性来进行决策的,现在属性减少,精度也会下降。因为随机森林就是从当前节点随机选包含k个属性的子集,在从子集中选择一个最优属性划分,现在属性减少,随机性也会减小。
GaussianNB的增加因为进行PCA将维后,使得样本投影的方差放缩到单位方差,对于贝叶斯高斯分布来说,使得数据点更比原来更“连续”了。
现在为了达到更好的结果,我对SVM做优化,来获得最优的准确性。通过PCA的维数设置,使得特征下降到了46个,得到了0.98242的精度。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值