CNN中文垃圾邮件分类（二）

最新推荐文章于 2024-03-20 18:44:03 发布

空字符（公众号：月来客栈）

最新推荐文章于 2024-03-20 18:44:03 发布

阅读量2.7k

点赞数 3

分类专栏： Tensorflow框架

本文链接：https://blog.csdn.net/The_lastest/article/details/81746887

版权

本文基于唐宇迪老师的教程，介绍如何使用CNN进行中文垃圾邮件的分类。首先，利用分词后的数据创建词表，训练词向量。接着，通过Padding使样本长度一致，并用词向量表示每封邮件。最后，对比了使用预训练词向量和不使用的情况，展示了CNN的结构差异。

摘要由CSDN通过智能技术生成

本文整理自唐宇迪老师视频，谢谢他！

1.思路

在上一篇博客CNN中文垃圾邮件分类（一）中介绍了两种预处理方式，现在来介绍第二种，先用分好词的数据作为训练语料，选择前n个词作为词表（或者去掉出现频率较低的词），然后先训练出每个词所代表的词向量。再根据词表得到每封邮件中每个词在词表中的索引，然后按索引取出向量量堆叠起来。

2. 数据预处理

第一步同之前一样，先去掉非中文的其它字符，然后分词

def clean_str(string):
    string.strip('\n')
    string = re.sub(r"[^\u4e00-\u9fff]", " ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip()


def cut_line(line):
    line = clean_str(line)
    seg_list = jieba.cut(line)
    cut_words = " ".join(seg_list)
    return cut_words


def load_data_and_labels(positive_data_file, negative_data_file):
    positive = []
    negative = []
    for line in open(positive_data_file, encoding='utf-8'):
        positive.append(cut_line(line).split())
    for line in open(negative_data_file, encoding='utf-8'):
        negative.append(cut_line(line).split())

    x_text = positive + negative

    positive_label = [[0, 1] for _ in positive]  # 构造one-hot 标签[[0, 1], [0, 1], [0, 1], [0, 1],....]
    negative_label = [[1, 0] for _ in negative]