自编朴素贝叶斯用于文本分类

最新推荐文章于 2022-02-04 02:17:25 发布

mini猿要成长QAQ

最新推荐文章于 2022-02-04 02:17:25 发布

阅读量1k

点赞数

分类专栏：文本分类

本文链接：https://blog.csdn.net/sgfmby1994/article/details/53437063

版权

文本分类专栏收录该内容

3 篇文章 0 订阅

订阅专栏

python中的sklearn包其实是提供了朴素贝叶斯的函数的，但是为了更好地理解朴素贝叶斯，还是选择了自编一遍。

说明：

这次的朴素贝叶斯适用于文本分类的检测，所以说明一下本程序的输入输入吧~

skf是由交叉验证得到的

from sklearn.cross_validation import StratifiedKFold

这个sklearn里提供了交叉验证的方法

为啥要用到np.memmap呢？因为我的电脑系统是32位，只能装32位的python，然后在处理大量数据的时候，出现了内存不足的情况，心痛，别的小伙伴说在整个工程中从没有出现过这个问题，我想了想，我似乎花了很多时间来处理memory error，用到了稀疏矩阵sparse.coo_matrix、结构化数组np.memmap、pandas等都是为了解决memory error，心塞......不过也因为这次问题了解和掌握了这种高级数据处理函数的用法，有失必有得吧。关于这两种结构的用法可以用看大神们的博客，讲的还是很直白易懂的。

关于的作用，下面这段话应该可以帮助理解：
内存映像文件是一种将磁盘上的非常大的二进制数据文件当做内存中的数组进行处理的方式。NumPy实现了一个类似于ndarray的memmap对象，它允许将大文件分成小段进行读写，而不是一次性将整个数组读入内存。memmap也拥有跟普通数组一样的方法，因此，基本上只要是能用于ndarray的算法就也能用于memmap。使用函数np.memmap并传入一个文件路径、数据类型、形状以及文件模式，即可创建一个新的memmap。

For large data, use np.memmap for memory mapping。

后来，花费大部分时候的就是优化速度，之前跑20万的数据需要12分钟（10万用于训练，10万用于测试）。太长了，觉得自己写了太多的for，时间复杂度太高了，尤其是测试阶段，如果对10万文章的每一篇的每一篇都要遍历所有特征词，得到对应的这个类别下这个特征词为1时的概率，然后全部一个一个相乘起来，计算属于每个类别的概率，时间太长了。后来发现np.array超级好用。

这之中还用到了拉普拉斯平滑（别看名字这么高端，其实就是分子加1，分母加2，本来先用m估计，发现没看明白......）

没有先把很小的概率值相乘再求log，而是对每个概率值求log之后再相加（log（a*b*c）=loga+logb+logc）,这样避免了很多的特别小的概率值相乘，导致小到计算机直接认为是0，就无法比较大小了。

程序如下：其中还有混淆矩阵以及roc曲线的画法，都是搜集的别人的程序改写的。

for train_index, test_index in skf:
    print "本次验证的训练样本数目为："+str(len(train_index))
    print "本次验证的测试样本数目为：" + str(len(test_index))
    X_train,X_test = sparse.coo_matrix(X[train_index]),sparse.coo_matrix(X[test_index])
    y_train, y_test = y[train_index], y[test_index]
    # print type(y_train)  #<type 'numpy.ndarray'>


    print '*************************\nNaive_bayes'
    # clf_bayes = MultinomialNB(alpha=0.01)
    # clf_bayes.fit(X_train, y_train)
    # y_pre_bayes[test_index] = clf_bayes.predict(X_test)
    #朴素贝叶斯分类
    print "训练开始"
    bayes_train_starttime = time.time()
    y_train_list = y_train.tolist()  #numpy.ndarray to list
    if (times == 0) :
        X_train_bayes = np.memmap('mymmap2', mode='w+', dtype=np.int16, shape=X_train.shape)
    if (times == 1) :
        X_train_bayes = np.memmap('mymmap3', mode='w+', dtype=np.int16, shape=X_train.shape)

    for i, j, v in zip(X_train.row, X_train.col, X_train.data):
        X_train_bayes[i, j] = v
    class_amount = []  #用于记录训练样本中分属不同类别的文章的数目
    p_class_amount = [] ##用于记录训练样本中分属不同类别的文章的数目占总数的比例

    for category_num in range(0, 10):
        (locals()['feature_list_class' + str(category_num + 1) + '_1']) = []
        (locals()['p_feature_list_class' + str(category_num + 1) + '_1']) = []
    #会得到初始空的 feature_list_class1_1，。。。。。。p_feature_list_class1_1。。。。。
    #前者用于记录为第i个类别时，每个feature=1时的数目  后者用于记录为第i个类别时，每个feature=1时的数目占这个类别的比例
    for category_num in range(0, 10):
        class_amount.append(y_train_list.count(category_num))
        p_class_amount.append(("%.3f" % (float(y_train_list.count(category_num))/len(y_train_list))))  #保留3位小数
    index = 0  #用于记录到了哪个类别的范围
    for category_num in range(0, 10):
        print category_num
        for feature_num in range(0,X.shape[1]):
            # print list(np.where(y_train == category_num))
            # i1 = 0
            # def myfind(x, y):
            #     return [a for a in range(len(y)) if y[a] == x]
            # all_index = myfind(category_num,y_train.tolist())
            # for index in all_index:
            #     if X_train_bayes[index, feature_num] == 1:
            #         i1 += 1

            i1 = (X_train_bayes[index:index+class_amount[category_num], feature_num]).tolist().count(1)
            # np.column_stack(((locals()['feature_list_class' + str(category_num+1) + '_1']), i1))
            # i11=("%.3f" % (float(i1+1)/(class_amount[category_num]+2)))
            # np.column_stack(((locals()['p_feature_list_class' + str(category_num+1) + '_1']), i11))
            (locals()['feature_list_class' + str(category_num+1) + '_1']).append(i1)
            (locals()['p_feature_list_class' + str(category_num+1) + '_1']).append(("%.10f" % (float(i1+1)/(class_amount[category_num]+2))))
        index += class_amount[category_num]

    bayes_train_endtime = time.time()
    bayes_train_runtime = bayes_train_endtime - bayes_train_starttime
    print('朴素贝叶斯的训练时间为：%s Seconds' % (bayes_train_endtime - bayes_train_starttime))
    #测试开始！
    print "测试开始"
    from functools import reduce
    bayes_test_starttime = time.time()
    testnum = 1

    for test_num in test_index:
        print testnum
        testnum += 1
        belong_classn = []  # 记录属于10个类别的概率
        for category_num in range(0, 10):
            belong_classn.append(p_class_amount[category_num])  # 先初始化为p(类别)
        indexs_1 = np.array(np.where(X[test_num, :] == 1))
        indexs_0 = np.array(np.where(X[test_num, :] == 0))

        for category_num in range(0,10):
            locals()['p_feature_list_class' + str(category_num + 1) + '_1'] = np.array(locals()['p_feature_list_class' + str(category_num + 1) + '_1'])
            if indexs_1.shape[1] == 0:
                b_1 = 0
            # elif indexs_1.shape[1] == 1:
            #     b_1 = math.log(float(((locals()['p_feature_list_class' + str(category_num + 1) + '_1'])[indexs_1])[0,0]))
            else:
                b_1 =  (locals()['p_feature_list_class' + str(category_num + 1) + '_1'][indexs_1])[0]
                b_1 = sum([math.log(float(i)) for i in b_1])
                # b_1 =  reduce(lambda x, y: math.log(float(x)) + math.log(float(y)), b_1[0],math.log(float(belong_classn[category_num])))
            if indexs_0.shape[1] == 0:
                b_0 = 0
            # elif indexs_0.shape[1] == 1:
            #     b_0 = math.log(1-float(((locals()['p_feature_list_class' + str(category_num + 1) + '_1'][indexs_0])[0,0])))
            else:
                b_0 = (locals()['p_feature_list_class' + str(category_num + 1) + '_1'][indexs_0])[0]
                b_0 = sum([math.log(1-float(i)) for i in b_0])
                # b_0 = reduce(lambda x, y: float(x) * float(y),(locals()['p_feature_list_class' + str(category_num + 1) + '_1'][indexs_0]))

            belong_classn[category_num] =  math.log(float(belong_classn[category_num])) + b_1 + b_0
        #     print belong_classn[category_num]
        #     for ii1 in indexs_1:
        #         for i1 in ii1:
        #             equal_1_value = (locals()['p_feature_list_class' + str(category_num + 1) + '_1'])[i1]
        #             belong_classn[category_num] = float(belong_classn[category_num]) * float(equal_1_value)
        #     for ii0 in indexs_0:
        #         for i0 in ii0:
        #             equal_0_value = (locals()['p_feature_list_class' + str(category_num + 1) + '_1'])[i0]
        #             belong_classn[category_num] = float(belong_classn[category_num]) * (1-float(equal_0_value))
        #     # print belong_classn[category_num]
        #     belong_classn[category_num] = math.log(float(belong_classn[category_num]) + (1e-100))
        #     #这里发现有时会报错，说明出现了乘子是0的，造成了最终log（0），报错，所以需要对之前的概率进行平滑估计，避免零概率的产生
        #     #得到了这个文章分别属于各个类别的概率

        y_pre_bayes[test_num] = belong_classn.index(max(belong_classn))  #概率最大的值对应的类别即为预测类别
    bayes_test_endtime = time.time()
    bayes_test_runtime = bayes_test_endtime - bayes_test_starttime
    print('朴素贝叶斯的测试时间为：%s Seconds' % (bayes_test_endtime - bayes_test_starttime))

    # # Compute ROC curve and area the curve
    # # 通过roc_curve()函数，求出fpr和tpr，以及阈值
    # bayes_fpr, bayes_tpr, bayes_thresholds = roc_curve(y[test_index], y_pre_bayes[test_index])
    # bayes_mean_tpr += interp(bayes_mean_fpr, bayes_fpr, bayes_tpr)  # 对mean_tpr在mean_fpr处进行插值，通过scipy包调用interp()函数
    # bayes_mean_tpr[0] = 0.0  # 初始处为0
    # bayes_roc_auc = auc(bayes_fpr, bayes_tpr)
    # # 画图，只需要plt.plot(fpr,tpr),变量roc_auc只是记录auc的值，通过auc()函数能计算出来
    # plt.plot(bayes_fpr, bayes_tpr, ':',lw=2, label='Naive_bayes ROC fold %d (area = %0.2f)' % (times + 1, bayes_roc_auc))

    bayes_accuracy_rate[times] = np.mean(y_pre_bayes[test_index] == y[test_index])
    print '利用朴素贝叶斯第%d次进行分类的正确率为 %.6f' %((times+1),bayes_accuracy_rate[times])
    # times += 1
    # print 'precision,recall,F1-score如下：》》》》》》》》'
    # print (classification_report(y[test_index], y_pre_bayes[test_index]))
    # print '混淆矩阵如下：》》》》》》'
    # cm = confusion_matrix(y[test_index], y_pre_bayes[test_index])
    # plt.figure()
    # plot_confusion_matrix(cm)
    # plt.show()