【NLP】6 gensim word2vec基于中文语料库实战——中文wiki百科、清华大学自然语言处理实验室数据集、搜狗全网新闻数据集

最新推荐文章于 2024-07-10 00:00:24 发布

Yang SiCheng

最新推荐文章于 2024-07-10 00:00:24 发布

阅读量4.5k

点赞数 1

分类专栏：【自然语言处理】文章标签： python 自然语言处理 nlp 人工智能

本文链接：https://blog.csdn.net/qq_41897800/article/details/113802995

版权

【自然语言处理】专栏收录该内容

19 篇文章

订阅专栏

本文详细介绍了如何获取与处理大规模中文语料，包括英汉维基百科、清华大学自然语言处理实验室数据集、搜狗全网新闻数据集，涉及数据下载、解压、清洗、分词、停用词处理等步骤，并利用gensim的Word2Vec模型进行训练，最终生成词向量模型。训练过程中，通过调整参数并分析模型效果，为后续的语义分析与相似度计算奠定了基础。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. 数据下载

英文语料数据来自英语国家语料库（British National Corpus, 简称BNC）(538MB, 样例数据22MB)和美国国家语料库（318MB），中文语料来自清华大学自然语言处理实验室：一个高效的中文文本分类工具包(1.45GB)和中文维基百科，下载点此(1.96GB)，搜狗全网新闻数据集之前下载使用过

踩坑，英语国家语料库和美国国家语料库迅雷下载很慢，而且中途会多次遇到下载停止的问题，改用Internet Download Manager，一晚上电脑不关机就下完了

2. 中文wiki百科

2.1 数据获取

参考这篇和这篇文章

报错：

OSError: Invalid data stream

参考这篇文章，应该是未解压前的 ‘.bz2’ 格式文件

报错：

TypeError: sequence item 0: expected a bytes-like object, str found

解决办法参考这个和这个

注： join() 方法用于将序列中的元素以指定的字符连接生成一个新的字符串

# -*- coding: UTF-8 -*-
 
str = "-";
seq = ("a", "b", "c"); # 字符串序列
print str.join( seq );

结果：

a-b-c

获得txt格式数据代码如下所示：

from gensim.corpora import WikiCorpus

def parse_corpus():
    space = ''
    i = 0
    output = open('...your path/zhwiki-latest-pages-articles.xml/zhwiki_{}.txt'.format(i),'w',encoding='utf-8')
    wiki = WikiCorpus('...your path/zhwiki-latest-pages-articles.xml.bz2',lemmatize=False,dictionary={})       # gensim中的维基百科处理类WikiCorpus
    for text in wiki.get_texts():
        output.write(space.join(text)+'\n')
        i += 1
        if(i % 10000 == 0):
            print('Saved' + str(i) + 'articles')
            output = open(
                '...your path/zhwiki-latest-pages-articles.xml/zhwiki_{}.txt'.format(int(i/10000)), 'w', encoding='utf-8')
    output.close()
    print('Finish Saved' + str(i) + 'articles')

……
Saved370000articles
Saved380000articles
Finish Saved384755articles

所以一共是384755篇文章，保存了38000篇文章在38个txt格式文件中，总共1.28G：

在这里插入图片描述

2.2 数据处理

清除英文字符

有个坑，一开始使用

regex = r'[a‐z]+'

一直不能匹配，最后发现应该用

regex = r'[a-z]+'

这两个减号是不一样的……请动手敲一遍

删除英文的代码如下：

# regex = r'[a-z]+ '       # 由26个字母+空格组成的字符串
# txt = re.sub(regex,'',txt)
# regex = r'[a-z]+'  		# 由26个字母组成的字符串
# txt = re.sub(regex, '', txt)

但注意到文本里面还有日文等其它字符，所以这里参考文章，而且注意到text8文件里面是没有标点符号的，中文维基百科是没有数字的，所以只保留中文字符和相关空格即可

繁体转化为简体

安装opencc库，输入：

pip install opencc

安装成功：
在这里插入图片描述
参考此文章：

t2s - 繁体转简体（Traditional Chinese to Simplified Chinese）
s2t - 简体转繁体（Simplified Chinese to Traditional Chinese）
mix2t - 混合转繁体（Mixed to Traditional Chinese）
mix2s - 混合转简体（Mixed to Simplified Chinese）

注：在繁体转简体遇到了一个问题，就是有个txt文件进去几乎不动，等了一个小时还在繁体转简体这儿，所以对此文件单独进行繁体转简体，把每一行（每一篇文章）作为一个输入进行转化，最后结果追加到一个列表后

值得注意到的是，列表转字符串不能直接用str()函数，这样会有转义字符’\n’的问题，例如：

list = ['我','爱','你','\n']
txt = '\'' + str(list).replace('[','').replace(']','').replace('\'','').replace(', ','') + '\''
print(txt)
print('我爱你\n')
txt = ''.join(list)
print(txt)

结果：

'我爱你\n'
我爱你

我爱你

参考文章，所以列表转字符串需要使用.join()函数

中文分词

之前文章【NLP】3 word2vec库与基于搜狗全网新闻数据集实例介绍过，这里不叙述

停词表

停用词的意思就是意义不大、不重要的词语，所以提取关键字、关键短语、关键句子时都默认去掉了去停用词的思想：在原始文本集中去掉不需要的词汇，字符

采用四川大学机器智能实验室停用词库，其他的停词表见Python 1.2 中文文本分析常用停用词表，英文参考此文章

停词原理：

line = '牙齿 突然 又 小 了 许多 第二'
for stopword in stopwords_list:
   for item in line.split(' '):
       if(item == stopword):
           line = line.replace(stopword+' ', '')
print(line)

结果：

牙齿 突然 又 小 许多 第二

代码：

import re
import opencc
import jieba

stopwords_table_path = '...your path/四川大学机器智能实验室停用词库.txt'
file = open(stopwords_table_path, 'r', encoding='utf-8')
stopwords_table = file.readlines()
stopwords_list = []
for item in stopwords_table:
    stopwords_list.append(item.replace('\n',''))        # 创建停词表


def simplified_Chinese(txt):
    cc = opencc.OpenCC('t2s')
    # txt = cc.convert(txt)
    i = 0
    txt_sim = []
    for sentence in txt.split('\n'):
        txt_sim.append(cc.convert(sentence) + '\n')
        print("第{}句话转化成简体成功".format(i))
        i += 1
    txt = ''.join(txt_sim)
    return txt


def stopwords(i):
    # 以下为停词部分
    in_path = '...your path/zhwiki-latest-pages-articles.xml/parse_txt/'
    file = open(in_path + 'zhwikiSegDone_{}.txt'.format(i), 'r', encoding='utf-8')
    out_path = '...your path/zhwiki-latest-pages-articles.xml/stopword_txt/'
    txt = open(out_path + 'zhwikiStopWord_{}.txt'.format(i), 'w', encoding='utf-8')

    for line in file.readlines():
        for item in line.split(' '):
            for stopword in stopwords_list:
                if item == stopword:
                    line = line.replace(stopword+' ','')
        txt.write(line)
    return



def seg_done(i, txt):
    # 以下为分词部分
    out_path = '...your path/zhwiki-latest-pages-articles.xml/parse_txt/'
    file = open(out_path + 'zhwikiSegDone_{}.txt'.format(i),'w',encoding='utf-8')
    file.write(' '.join(jieba.cut(txt, cut_all=False)).replace(' \n ', '\n'))
    file.close()

    # 以下为分词又停词部分
    # out_path = '...your path/zhwiki-latest-pages-articles.xml/segdone_stopword_txt/'
    # file = open(out_path + 'zhwiki_prepro_{}.txt'.format(i),'w',encoding='utf-8')
    # txt = ' '.join(jieba.cut(txt, cut_all=False)).replace(' \n ', '\n')
    # print('第' + str(i) + '个txt文件汉字分词成功')
    # for sentence in txt.split('\n'):
    #     for word in sentence.split(' '):
    #         for stopword in stopwords_list:
    #             if word == stopword:
    #                 sentence = sentence.replace(stopword+' ','')
    #     file.write(sentence+'\n')
    # file.close()
    # print('第' + str(i) + '个txt文件汉字停词成功')


def parse_txt():
    in_path = '...your path/zhwiki-latest-pages-articles.xml/zhwiki/'
    for i in range(21,22):      # 理论上应该是从0至39，即[0,39)
        file = open(in_path+'zhwiki_{}.txt'.format(i),'r',encoding='utf-8')
        txt = file.read()
        file.close()
        txt = ''.join(re.findall('[\u4e00-\u9fa5|\n]',txt))      # 只保留汉字,如果其后有空格则保留
        print('第' + str(i) + '个txt文件提取汉字成功')
        txt = simplified_Chinese(txt)
        print('第' + str(i) + '个txt文件繁体汉字转化简体汉字成功')
        seg_done(i, txt)


def main():
    # parse_txt()
    stopwords(21)       # 第21个文件直接运行很慢，不知道为什么
    return


if __name__ == '__main__':
    main()

结果：

在这里插入图片描述
得到了从’zhwiki_prepro_0’到’zhwiki_prepro_38’的共39个文件，中文维基百科数据处理完成

3. 清华大学自然语言处理实验室数据集

下载好解压出来即可，得到名为’THUCNews’的文件夹，包含以下18类数据：

在这里插入图片描述
代码：

import re
import jieba

stopwords_table_path = '...your path/四川大学机器智能实验室停用词库.txt'
file = open(stopwords_table_path, 'r', encoding='utf-8')
stopwords_table = file.readlines()
stopwords_list = []
for item in stopwords_table:
    stopwords_list.append(item.replace('\n',''))        # 创建停词表

NewsCatalog = ['体育','娱乐','家居','彩票','房产','教育','时尚','时政','星座','游戏','社会','科技','股票','财经']
# NewsCatalog = ['社会','科技','股票','财经']     # i = 430801
# NewsCatalog = ['时政']

file_path = '...your path/THUCNews/THUCNews/'

i = 0
for category in NewsCatalog:
    combine = open(file_path + '{}.txt'.format(category), 'w', encoding='utf-8')
    sentence = []
    while(True):
        if(i % 200 == 0):
            print('\r处理完成文本数量：{}'.format(i), end='')
        try:
            file = open(file_path + category + '/' + '{}.txt'.format(i), 'r', encoding='utf-8')
            i += 1
            txt = file.read().replace('\n　　',' ')      # 一篇文章为一排
            file.close()
            txt = ''.join(re.findall('[\u4e00-\u9fa5| |]', txt))
            txt = ' '.join(jieba.cut(txt, cut_all=False)).replace('   ',' ')
            for word in txt.split(' '):
                for stopword in stopwords_list:
                    if word == stopword:
                        txt = txt.replace(stopword+' ','')
            sentence.append(txt+'\n')
        except:
            combine.write(''.join(sentence))
            print(category + '文本处理完毕')
            break

结果：

……
处理完成文本数量：481600社会文本处理完毕
处理完成文本数量：644400科技文本处理完毕
处理完成文本数量：798800股票文本处理完毕
处理完成文本数量：836000财经文本处理完毕

在这里插入图片描述

4. 搜狗全网新闻数据集

重新处理一下数据，只留下中文字符和换行符，分词，停词，每三万篇新闻输出为一个txt文件，代码如下：

import re
import jieba

stopwords_table_path = '...your path/四川大学机器智能实验室停用词库.txt'
file = open(stopwords_table_path, 'r', encoding='utf-8')
stopwords_table = file.readlines()
stopwords_list = []
for item in stopwords_table:
    stopwords_list.append(item.replace('\n',''))        # 创建停词表


path = '...your path/news_tensite_xml.full/'

file = open(path + 'news_tensite_xml.txt', 'r', encoding='gb18030')

news = []
i = 0

for txt in file.readlines():
    txt = ''.join(re.findall('[\u4e00-\u9fa5|\n]', txt))
    txt = ' '.join(jieba.cut(txt, cut_all=False))
    for word in txt.split(' '):
        for stopword in stopwords_list:
            if word == stopword:
                txt = txt.replace(stopword + ' ', '')
    news.append(txt)
    if i > 0 and i % 30000 == 0:
        out_file = open(path + 'Sogounews_{}.txt'.format(int(i/30000-1)), 'w', encoding='utf-8')
        out_file.write(''.join(news))
        out_file.close()
        news = []
    i += 1
file.close()

结果：

在这里插入图片描述
一共114万多个文件，共38个txt文件

5. 文件合并

由于最后模型训练只需要一个txt文件就可以了，所以需要把以上所有语料库txt格式的数据合并成一个文件，关于合并的原理如下：

txt1文件：

This is a Test
You

txt2文件：

This also a test
As you can see

注意空格和换行符，这里希望不同文本之间不要留空格，.strip用法见此

代码：

path = '...your path/test/'

file1 = open(path + '{}.txt'.format(1), 'r', encoding='utf-8')
file2 = open(path + '{}.txt'.format(2), 'r', encoding='utf-8')
file = open(path + '{}.txt'.format('total'), 'a', encoding='utf-8')
txt1 = file1.read().strip('\n').strip(' ')
txt2 = file2.read().strip('\n').strip(' ')
file.write(txt1 + '\n')
file.write(txt2 + '\n')

结果：

This is a Test
You
This also a test
As you can see

下面对以上数据处理的文件进行合并：

中文wiki百科文件，共39个，1.23G；
THU数据集，14个类别，1.86G；
搜狗新闻数据集，共38个，1.60G

文件合并其实是一个复制粘贴的过程，然后处理一些换行、空格的问题，由于win10自带的记事本不能打开很大的文件（一般超过1G），所以还是通过代码实现：

NewsCatalog = ['体育','娱乐','家居','彩票','房产','教育','时尚','时政','星座','游戏','社会','科技','股票','财经']

path = '...your path/test/'
wiki_path = '...your path/zhwiki-latest-pages-articles.xml/segdone_stopword_txt/'
THUCNews_path = '...your path/THUCNews/THUCNews/'
SougouNews_path = '...your path/news_tensite_xml.full/'

Data = open(path + 'Data.txt', 'a', encoding='utf-8')

for i in range(39):      # 合并中文wiki百科文件
    file = open(wiki_path + 'zhwiki_prepro_{}.txt'.format(i), 'r', encoding='utf-8')
    txt = file.read().strip('\n').strip(' ')
    Data.write(txt + '\n')
    file.close()

print('中文wiki百科文件合并完成')

for item in NewsCatalog:        # 合并THU数据集
    file = open(THUCNews_path + '{}.txt'.format(item), 'r', encoding='utf-8')
    txt = file.read().strip('\n').strip(' ')
    Data.write(txt + '\n')
    file.close()

print('THU数据集合并完成')

for i in range(38):      # 合并搜狗新闻数据集
    file = open(SougouNews_path + 'Sogounews_{}.txt'.format(i), 'r', encoding='utf-8')
    txt = file.read().strip('\n').strip(' ')
    Data.write(txt + '\n')
    file.close()

print('搜狗新闻数据集合并完成')

以上三个文件相加为4.69GB，实际产生文件’Data.txt’为4.70GB，这很合理

6. Word2Vec

模型训练

首先把整个txt文档放到sentences参数里面进行训练，报了一个MemoryError的错误：

path = '...your path/test/Data.txt'
file = open(path, 'r', encoding='utf-8')
txt = file.read()

model = Word2Vec(sentences=txt, size=300, window=5, iter=10)

显然不能这么写，参考一些文章，需要使用word2vec库中的LineSentence函数，完整训练代码如下：

from gensim.models import Word2Vec
from gensim.models import word2vec

path = '...your path/test/Data.txt'

sentences = word2vec.LineSentence(path)

# sg——word2vec两个模型的选择。如果是0， 则是CBOW模型，是1则是Skip-Gram模型，默认是0即CBOW模型
# hs——word2vec两个解法的选择，如果是0， 则是Negative Sampling，是1的话并且负采样个数negative大于0， 则是Hierarchical Softmax。默认是0即Negative Sampling
# negative——即使用Negative Sampling时负采样的个数，默认是5。推荐在[3,10]之间
# min_count——需要计算词向量的最小词频。这个值可以去掉一些很生僻的低频词，默认是5。如果是小语料，可以调低这个值
# iter——随机梯度下降法中迭代的最大次数，默认是5。对于大语料，可以增大这个值
# alpha——在随机梯度下降法中迭代的初始步长。算法原理篇中标记为η，默认是0.025
# min_alpha——由于算法支持在迭代的过程中逐渐减小步长，min_alpha给出了最小的迭代步长值
model = Word2Vec(sentences, size=300, window=5, iter=10)
model.save('word2vec.model')

模型测试：

model = Word2Vec.load('word2vec.model')


print(model.vector_size)
print(model.accuracy)
print(model.total_train_time)
print(model.wv)
print(model.most_similar('清华大学'))
print(model.most_similar('狗'))
print(model.most_similar('爱因斯坦'))
print(model.most_similar('加拿大'))

结果：

300
<bound method Word2Vec.accuracy of <gensim.models.word2vec.Word2Vec object at 0x0000022002D12B50>>
11660.956405
<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x0000022002D120D0>
[('北京大学', 0.8560054302215576), ('清华', 0.75086909532547), ('复旦大学', 0.7291961312294006), ('武汉大学', 0.6843674182891846), ('南开大学', 0.6799485683441162), ('北大', 0.6776059865951538), ('上海交通大学', 0.6770678758621216), ('浙江大学', 0.6726593971252441), ('上海交大', 0.6696270108222961), ('人民大学', 0.6536081433296204)]
[('小狗', 0.778811514377594), ('猫', 0.6958572864532471), ('狗狗', 0.6867367029190063), ('宠物狗', 0.6718654632568359), ('流浪狗', 0.6594550013542175), ('大狗', 0.6557995676994324), ('小猫', 0.6476685404777527), ('金毛犬', 0.6255425214767456), ('狼狗', 0.6238532662391663), ('爱犬', 0.6213604807853699)]
[('海森堡', 0.6210607290267944), ('玻尔', 0.6101648807525635), ('霍金', 0.5979549288749695), ('相对论', 0.5952655076980591), ('朗道', 0.5940419435501099), ('波耳', 0.5926888585090637), ('薛定谔', 0.5919954180717468), ('泡利', 0.5903158187866211), ('劳厄', 0.5894380807876587), ('爱丁顿', 0.5840190052986145)]
[('澳大利亚', 0.799114465713501), ('澳洲', 0.7645018100738525), ('新西兰', 0.7618841528892517), ('英国', 0.7025945782661438), ('纽西兰', 0.6988958120346069), ('多伦多', 0.692410409450531), ('墨西哥', 0.6753289699554443), ('渥太华', 0.6639829277992249), ('温哥华', 0.6550958156585693), ('新加坡', 0.6529322266578674)]

报警告：

E:/Users/Yang SiCheng/PycharmProjects/main.py:233: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
 print(model.most_similar('清华大学'))
 ……		# 狗、爱因斯坦、加拿大

模型分析

可以发现训练时间总共为11660.956405秒，即3.24小时，下次训练请注意加上日志，应该可以查看训练的进度，配置如下：

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

参考上一篇文章，不使用完整模型而使用KeyedVectors的库，即键和向量的形式，代码如下：

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

model = Word2Vec.load('word2vec.model')
word_vectors = model.wv

word_vectors.save('vectors.kv')
# reloaded_word_vectors = KeyedVectors.load('vectors.kv')

小结

本次训练结果还算目测比之前【NLP】3 word2vec库与基于搜狗全网新闻数据集实例好了很多，下一步工作方向：

如何评测word2vec结果好坏？是否有类似【NLP】文献翻译2——英语单词语义相似性的Word2Vec模型分析的中文数据集能够评价准确度、相关系数
由单词向量生成句子向量，能够比较句子之间的相似度，例如简单追加词向量、加权求和、卷积神经网络、doc2vec、gensim word2vec自带的测句子相似度和多个词相似的方法原理，可以都分析比较一下
参考【NLP】文献翻译2——英语单词语义相似性的Word2Vec模型分析，也改变参数window、size即窗口和维度的大小（或许训练方法），再次训练，再次分析模型结果，比较参数对模型的影响