自然语言处理学习之三

最新推荐文章于 2021-04-15 11:00:09 发布

weiwen6933

最新推荐文章于 2021-04-15 11:00:09 发布

阅读量467

点赞数

本文链接：https://blog.csdn.net/weiwen6933/article/details/104226749

版权

Day 3

维基百科2017中文数据提取码: ttzr
戳这里下载维基百科最新语料库

使用Gensim库构造词向量

简单示例

from gensim.models import word2vec
import logging #对需要打印的日志的格式进行定义
logging.basicConfig(format='%(asctime)s:%(message)s',level=logging.INFO)

raw_sentences = ['the quick brown fox jumps over the lazy dogs','yoyoyo you go home now to sleep']

sentences = [s.split() for s in raw_sentences]
print(sentences)

[[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dogs’], [‘yoyoyo’, ‘you’, ‘go’, ‘home’, ‘now’, ‘to’, ‘sleep’]]

model = word2vec.Word2Vec(sentences,min_count=1)

参数解析
min_cuont: 在不同大小的语料库中，基准词频的需要也是不同的。在较大的语料库中，只出现一两次的单词可以忽略，这里就可以通过min_count参数进行控制，一般在0-100之间。
Size：设置神经网络的层数。默认值是100.更多层意味着更多的输入数据，不过也能提升整体的准确度。合理的设置范围是10-数百。

使用维基百科数据

1、将xml提取成txt文件

#修改后的代码
import logging
import os.path
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
    
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # check and process input arguments
    if len(sys.argv) < 3:
        print (globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = ''
    i = 0
    output = open(outp, 'w',encoding='utf-8')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        s=space.join(text)
        s=s.encode('utf8').decode('utf8') + "\n"
        output.write(s)
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

在终端运行如下命令

cd /Users/mac/Desktop/csdn/自然语言处理/Gensim-代码
python3 process.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

上面的代码经过修改，因为相继出现两个问题
1）TypeError: sequence item 0: expected a bytes-like object, str found
解决：原代码space = b’ ‘，但text中为str不是字节，不能用字节连接 s=space.join(text)不成立，可以将b’ '改成 ’ '。
2）‘str’ object has no attribute 'decode’
解决：原代码 s=s.decode(‘utf8’) + “\n”，python3里面，字符串要先encode手动指定其为某一编码的字节码之后，才能decode解码。因此改为s=s.encode(‘utf8’).decode(‘utf8’) + “\n”。

2、jieba分词
安装jieba分词包，将提取后的txt文本进行分词操作

import jieba
import jieba.analyse
import jieba.posseg as pseg
import codecs,sys
def cut_words(sentence):
    #print sentence
    return " ".join(jieba.cut(sentence)).encode('utf-8')
f=codecs.open('wiki.zh.jian.text','r',encoding="utf8")
target = codecs.open("zh.jian.wiki.seg-1.3g.txt", 'w',encoding="utf8")
print ('open files')
line_num=1
line = f.readline()
while line:
    print('---- processing ', line_num, ' article----------------')
    line_seg = " ".join(jieba.cut(line))
    target.writelines(line_seg)
    line_num = line_num + 1
    line = f.readline()
f.close()
target.close()
exit()
while line:
    curr = []
    for oneline in line:
        #print(oneline)
        curr.append(oneline)
    after_cut = map(cut_words, curr)
    target.writelines(after_cut)
    print ('saved',line_num,'articles')
    exit()
    line = f.readline1()
f.close()
target.close()

在这里插入图片描述 3、使用word2vec建立模型

import logging
import os.path
import sys
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
    
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # check and process input arguments
    if len(sys.argv) < 4:
        print (globals()['__doc__'] % locals())
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
    model.save(outp1)
    model.model.wv.save_word2vec_format(outp2, binary=False)

终端执行

python3 word2vec_model.py zh.jian.wiki.seg-1.3g.txt wiki.zh.text.model wiki.zh.text.vector

4、测试训练后的模型

#找出与以下词语最接近的结果，通过向量计算
from gensim.models import Word2Vec
en_wiki_word2vec_model = Word2Vec.load('wiki.zh.text.model')

testwords = ['数学','篮球','榕树','同性恋']
for i in range(4):
    res = en_wiki_word2vec_model.most_similar(testwords[i])
    print (testwords[i])
    print (res)