自动写诗APP项目、基于python+Android实现（技术：LSTM+Fasttext分类+word2vec+Flask+mysql）第二节-CSDN博客

本文链接：https://blog.csdn.net/Turbo_Come/article/details/94153192

一：诗歌分类

首先，从网上搜集到近30万首诗歌，但这些诗歌并没有明确分类。为了将诗歌进行分类，在古诗文网（https://www.gushiwen.org/）上分别爬取边塞征战、写景咏物、山水田园、思乡羁旅、咏史怀古五类诗歌各600首，用于做分类的训练数据集。由这些数据训练得到一个分类模型，由此分类模型对那30万首诗歌进行分类。

数据集（5类诗歌数据600*5+Fasttext.model分类模型（也可自己训练）+停用词表（哈工大停用词表）+30万首诗歌数据+Fasttext处理后的分类数据集）：

链接：https://pan.baidu.com/s/1ms2TFhVlbN44JN7Yaw7xZg
提取码：dbvt

1、爬取6类诗歌的代码实现：

"""
    从古诗文网（https://www.gushiwen.org/）上爬取网页中的6类唐诗
    在古诗文网上，输入一类古诗，可能爬取的数量不够 600 首，对于这中情况，我是又输入了此类诗的主题词，
    能反映出其特点的情感词汇等，最终获取到每类诗600首
    1 边塞征战
    2 写景咏物
    3 山水田园
    4 思乡羁旅
    5 咏史怀古
"""
from config import *
def Get_url():
    # 改变标准输出的默认编码
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')
    # 需要爬的网址--古诗文网https://www.gushiwen.org/shiwen/
    # 输入一类，翻页爬去，5类诗每类获取 600首，用作分类训练、测试数据
    for i in range(1,20):
        # i 表示页面范围
        # 此网址总共有三种格式，经观察发现 page, A ,后跟随的便是页数，循环更改此值，便可实现翻页爬取
        # url='https://www.gushiwen.org/shiwen/default.aspx?page=9&type=4&id=1'
        # 'https://www.gushiwen.org/shiwen/default_1A589282347eb3A2.aspx'
        # 'https://so.gushiwen.org/search.aspx?type=title&page=3&value=%E7%BE%81%E6%97%85%E6%80%9D%E4%B9%A1'
        url1 = 'https://so.gushiwen.org/search.aspx?type=title&page='
        page = i
        page = '%d' % page
        # 每一个类型的 poem_type在url中的显示形式，5次不同
        poem_type = '&value=%e5%8f%a4%e8%bf%b9'  #（类型的url形式） 事先输入类型，确定页面确切地址
        url = url1+page+poem_type # 最终的 url 值
        head = {}
        # 在浏览器地址栏中输入 about:version (最好谷歌) 得到浏览器版本，用户代理，可实现伪装成浏览器
        head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
        req = request.Request(url, headers=head)
        response = request.urlopen(req)  # 返回 html代码
        html = response.read()
        soup = BeautifulSoup(html, 'lxml') # 使用BeautifulSoup 解析

        # poem = soup.find_all ('textarea')  # 得到<textarea>标签下的内容
        file=r'Data\train_jar\yongshi.txt'
        # 获取textarea标签中的内容
        for i in soup.find_all(re.compile('textarea')):
            str=''.join(i.text)# #列表->字符串
            str,_ = str.rsplit('https')
            # print(str)
            # 截取诗歌中的 题目，作者，内容
            content,str2 = str.rsplit('——')
            try :
                author,title = str2.rsplit('《')
                _, author =author.split('·')
                title,_ = title.split("》")
            except:
                continue
            # print(content) 查看诗歌内容
            with open(file,'a+') as f:
                # 以题目::作者::诗 的格式写入各个文件，一行一首诗
                f.write(title+"::"+author+"::"+content+'\n')
            f.close()

if __name__ == '__main__':
    Get_url()

2、FastText文本分类模型

先对数据做清洗。根据《哈工大停用词表》去停用词，再过滤诗歌中的“，。！（）《》”等特殊符号，并去除过长、过短诗歌。

Fasttext原理：

：表示一个句子的特征，初始值采用预训练的词向量

均值层的hidden：的平均值

输出层：样本的标签

目标函数：

FastText文本分类模型，结合了自然语言处理和机器学习中最成功的理念。包括使用词袋和n-gram袋表征语句，还使用子字(subword)信息，并通过隐藏表征在类别间共享信息。另外采用了一个softmax层级，利用类别不均衡分布的优势来加速运算的过程。

FastText模型包含三部分，模型架构，层次softmax和n-gram特征。模型架构和 Word2Vec 中的 CBOW 模型很类似。不同之处在于，FastText 预测的是标签，而 CBOW 模型是预测中间词。FastText 也利用了类别不均衡这个事实，通过使用 Huffman 算法建立用于表征类别的树形结构。

FastText做分类模型训练时格式：

'__label__' + classtage(类型) + '\t'+ line（诗内容） + '\n'。

FastText分类主要使用train_supervised()函数，在此分类模型中设置以下参数:

input=path, 训练文件路径

epoch=25, 训练轮次

lr=1.0, 初始学习率

dim=100,向量维度

word_ngrams=2, n-gram 设置

minCount=1 ，最低词频

loss=”softmax” 损失函数类型

由分类模型进行分类后，最终得到各类诗歌：

边塞征战（28932首）、写景咏物（38078首）、山水田园（78294首）、思乡羁旅（70618首）、咏史怀古（81870首）

根据五类诗歌分别绘制了五类诗歌的云图：

在词云中字体显示越大，表示词在此类诗中出现频率越高。

从中可见排名前五的实词分别为：

边塞征战：“万里”、“将军”、“天子”、“四海”、“天下”

写景咏物：“春风”、“梅花”、“江南”、“东风”、“故人”

山水田园：“青山”、“白云”、“人间”、“归来”、“山水”

思乡羁旅：“春风”、“何处”、“故人”、“秋风”、“明月”

咏史怀古：“平生”、“风雨”、“千里”、“人间”、“功名”

可以看出，在每类诗歌中，出现频率较高的词，在一定程度上是能够反映出此类诗歌的主题色彩的，也说明诗歌的分类效果还是不错的。

Fasttext分类具体代码实现如下：

"""
    诗分类模型 ；Fasttext分类
    数据格式 ：'__label__'  + classtage(类型) + '\t'+ line（诗内容） + '\n'
"""
from config import *
# 读取停用词
def read_Stopwords():
    # 加载停用词
    stop_word = []
    stop_path = 'Data/Stopwords.txt'
    with open (stop_path, 'r', encoding='utf-8') as stop_file:
        for line in stop_file:
            line = str (line.replace('\n', '').replace ('\r', '').split ())
            stop_word.append (line)
        stop_word = set (stop_word)  # 去重列表中重复的词汇
    return stop_word

# 训练数据统一格式
def fileopems_file_deal(path, classtage):
    stop_word = read_Stopwords()
    path_name = path
    rules = u'[\u4e00-\u9fa5]+' # 只是汉字的正则表达式，可以去除，。！（）等特殊符号
    pattern = re.compile(rules)
    sentences = []

    with open(path_name, 'r', encoding = 'utf-8') as f_reader:
        for line in f_reader:
            line = line.replace('\n','').replace('\r',"").split()
            line = str(line)
            line = ' '.join(jieba.cut(line)) # 以空格来分词
            seg_list = pattern.findall(line)
            word_list = []
            for word in seg_list:
                if word not in stop_word:  # 去除停用词
                    word_list.append(word)
            if len(word_list)> 0:
                sentences.append(word_list)
                line = ' '.join(word_list)
                f_write =open('Data/train_jar/shi.txt')
                line2 = '__label__' + classtage + '\t' + line + '\n'  # 统一Fasttext文本分类的格式
                f_write.write(line2)
                f_write.flush()  #强行把缓冲区中的内容放到磁盘中

# 对数据进行训练产生模型   **.model文件
def fasttext_deal():
    path = r'Data/train_jar\shi.txt'
    # 生成模型
    model = fastText.train_supervised(
        input = path,
        wordNgrams = 2, verbose=2, minCount=1
    )
    # 保存模型
    path_save = 'Data/train_jar/class_shi.model'
    model.save_model(path_save)

# 测试数据统一格式
def file_deal(test):
    stop_word = read_Stopwords()
    # 文本预处理
    sentecnces = []
    rules =u'[\u4e00-\u9fa5]+'
    pattern  =re.compile(rules)
    line =test
    line = line.replace('\r','').replace('\n','').split()
    line = str(line)
    line =' '.join(jieba.cut(line))
    seg_list = pattern.findall(line)
    word_list= []
    for word in seg_list:
        if word not in stop_word:
            word_list.append(word)  # 去除停用词
    if len(word_list)>0:  # 去除空行
        sentecnces.append(word_list)
        re_line = ' '.join(word_list) # 以空格来划分各各词
    return re_line

# # 对各类训练数据进行统一格式处理---> 汇总写入shi.txt文件
def sum_filepoems_to_shi():
    path1 = 'Data/train_jar/biansai.txt'
    classtage1 = '边塞征战'
    fileopems_file_deal(path1, classtage1)

    path2 = 'Data/train_jar/jingwu.txt'
    classtage2 = '写景咏物'
    fileopems_file_deal(path2, classtage2)

    path3 = 'Data/train_jar/shanshui.txt'
    classtage3 = '山水田园'
    fileopems_file_deal(path3, classtage3)

    path4 = 'Data/train_jar/sixiang.txt'
    classtage4 = '思乡羁旅'
    fileopems_file_deal(path4, classtage4)

    path5 = 'Data/train_jar/yongshi.txt'
    classtage5 = '咏史怀古'
    fileopems_file_deal(path5,classtage5)

if __name__ =='__main__':
    # 训练模型/加载模型
    save_path = 'Data/train_jar/class_shi.model'
    if os.path.exists('Data/train_jar/class_shi.model'):
        model = fastText.load_model(save_path)
    else: # 没有训练模型，先训练，再加载
        fasttext_deal()
        model = fastText.load_model(save_path)
    test_path = 'Data/train_jar/Whole_30w_poems.txt'  # 30w诗歌数据集
    # f_reader =open(test_path , 'r',encoding='utf-8')
    # 分类后数据写入各类文件
    class_path1 = "Data\Generate_poems_jar/biansai.txt"
    f_write1 = open (class_path1, 'a+', encoding='utf-8')
    class_path2 = "Data\Generate_poems_jar/jingwu.txt"
    f_write2 = open (class_path2, 'a+', encoding='utf-8')
    class_path3 = "Data\Generate_poems_jar/shanshui.txt"
    f_write3 = open (class_path3, 'a+', encoding='utf-8')
    class_path4 = "Data\Generate_poems_jar/sixiang.txt"
    f_write4 = open (class_path4, 'a+', encoding='utf-8')
    class_path5 = "Data\Generate_poems_jar/yongshi.txt"
    f_write5 = open (class_path5, 'a+', encoding='utf-8')

    with open(test_path,'r',encoding='utf-8') as f_reader:
        for line in f_reader:
            tests_str = file_deal(line)
            # print(tests_str)
            label = model.predict(tests_str) # 模型进行预测
            # label[0] 类别 label[1] 概率  label为元组
            value= str(label[0])
            if value ==  "('__label__边塞征战',)" :
                f_write1.write(line)
                f_write1.flush()
            elif value == "('__label__写景咏物',)" :
                f_write2.write(line)
                f_write2.flush()
            elif value == "('__label__山水田园',)" :
                f_write3.write(line)
                f_write3.flush()
            elif value == "('__label__思乡羁旅',)" :
                f_write4.write(line)
                f_write4.flush()
            elif value == "('__label__咏史怀古',)" :
                f_write5.write(line)
                f_write5.flush()