爬取语料并训练word2vec词向量

本文介绍了使用网络爬虫获取新闻数据,对数据进行分词处理,然后利用gensim库训练词向量的完整流程。
摘要由CSDN通过智能技术生成

一、用网络爬虫爬取数据

要训练出效果不错的词向量,通常需要G级以上的文本语料,本文作为demo,仅提供实现流程。因此,爬取一篇新闻作为训练语料。新闻链接这里

#!/usr/bin/python3
# -*- coding:utf-8 -*-

"""
@Author  : heyw
@Contact : he_yuanwen@126.com
@Time    : 2020/2/22 14:40
@Software: PyCharm
@FileName: spider_news.py
"""
import requests
from lxml import etree


url = "https://www.ithome.com/0/474/267.htm"
headers = {
   
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36"
}
html = requests.get(url=url, headers=headers)
# 保证爬取中文不乱码
html.encoding = 'utf-8'
selector = etree.HTML(html.text)
content = selector.xpath('//div[@class="post_content"]//text()')
# 过滤三种unicode空格:u00a0,\u0020,\u3000
content = list(map(lambda x:x.replace("\u3000","").replace("\xa0","").replace("\u0020",""),content))

with open('data/news.txt','w',encoding='utf-8') as f:
    for line in content:
        f.write(line + '\n')

二、对训练语料分词

#!/usr/bin/python3
# -*- coding:utf-8 -*-

"""
@Author  : heyw
@Contact : he_yuanwen@126.com
@Time    : 2020/2/22 17:54
@Software: PyCharm
@FileName: cut_words.py
"""
import pkuseg

seg = pkuseg.pkuseg()

with open('data/news.txt','r',encoding='utf-8') as fi:
    with open('data/split.txt','w',encoding='utf-8') as fo:
        for line in fi:
            words = seg.cut(line)
            words = " ".join(words)
            fo.write(words + '\n')

三、用gensim训练词向量

#!/usr/bin/python3
# -*- coding:utf-8 -*-

"""
@Author  : heyw
@Contact : he_yuanwen@126.com
@Time    : 2020/2/22 18:08
@Software: PyCharm
@FileName: gen_w2v.py
"""
import logging
import multiprocessing
from gensim.models import Word2Vec


if __name__ == '__main__':

    logger = logging.getLogger(__name__)
    logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % __name__)

    corpus_path = 'data/split.txt'
    w2v_path = 'data/embedding_w2v_d300.txt'

    content = [line.split() for line in open(corpus_path, 'r', encoding='utf-8')]

    model = Word2Vec(
        content, # LineSentence:一句一行(单词被空格分隔)
        size=300, # size:是每个词的向量维度
        window=5, # window:是词向量训练时的上下文扫描窗口大小,窗口为5就是考虑前5个词和后5个词
        min_count=1, # min-count:设置最低频率,默认是5,如果一个词出现的次数小于5,会被丢弃
        iter=10, # iter:迭代次数,默认为5(本案例语料较少,设置为1)
        sg=0, # 训练算法: 1用skip-gram;其他用CBOW。
        workers=multiprocessing.cpu_count() # workers:是训练的进程数,默认是当前运行机器的处理器核数
    )

    model.wv.save_word2vec_format(w2v_path, binary=False)
688 3006.262602e-05 -0.0027163788 0.0008756147 0.0061735353 -0.0073240288 0.00716846 0.002975204 -0.00017872732 -0.0052224346 0.003413806 -0.004052695 -0.0035309172 0.00046671976 -0.002551436 -0.0009848175 -0.0026873983 -0.005621847 -0.0013716457 -0.001474599 -0.0023198414 -0.008681036 0.0020407157 -0.0010219077 0.005812424 0.004032533 0.0024243672 -0.0034459247 -0.0035171115 0.0041755345 5.4383603e-05 -0.0021710512 0.0023965484 -0.00014343172 0.00076773635 -0.0026228651 0.0005330838 0.0008883267 0.0018137039 -0.0034815362 -0.002231264 0.002091963 0.004444013 -0.0060546994 -0.0019250679 -0.00379177 0.0019342139 -0.0043124277 -0.0071339416 -0.0038317372 0.0023740816 0.0007050315 -0.0025601508 -0.00092174514 -0.001653669 -0.0041769356 -0.0008112184 -0.001257235 0.003989195 -0.0010392789 0.00127763 0.0018152902 -0.0027851146 0.00083300634 -0.0044559645 0.0028523172 -0.010163012 -0.0045528486 0.0014947926 0.0012026059 -0.00031657913 0.001895888 0.004738325 0.0015175433 0.0028471665 0.0042165 -0.000855572 0.0042578056 0.0031430218 0.0010932523 0.0012770575 0.0049783792 -0.0031382516 -0.0023725021 0.002788679 0.0032919473 0.0041658147 0.00024141524 -0.008470099 0.0045172544 0.00011802803 0.0015675559 0.0070103086 -0.005200309 0.0010631548 0.0059017944 0.0001301832 -0.0021322942 0.00044017847 0.0024357268 -0.006519669 0.004111721 0.0063192686 -0.0016710443 -0.0010793274 7.0794986e-06 0.0030782444 0.0038724053 -0.00080210087 -0.004244308 0.0028713315 -0.0023449417 -0.0025035348 -0.00819717 -0.002727253 -0.0021448822 -0.0038948979 -0.0014661645 0.0044331797 -0.00048031413 -0.0015063789 -0.003591745 0.0037279192 -0.003591361 -0.0071042306 -0.006374069 0.0020980444 -0.0010187818 0.0025563443 0.00082238944 -0.00044422006 0.0042141033 -0.0031087063 0.0053503397 0.000670144 -0.0039186156 0.003906546 0.0010756183 -0.0023668532 -0.0060433005 4.1261224e-05 0.0023210798 0.0016768175 0.010201477 -0.0017619446 -0.0009712876 0.0034166623 -0.0033952207 -0.001333099 0.0012291789 0.0021359744 0.0018442409 -0.005561838 -0.00026732986 -0.0025908232 -0.0017506105 -0.006604991 -0.0002934155 5.05693e-05 0.0039026074 -0.0061120153 0.0014650774 0.009015601 0.003486246 0.009096118 -0.00089886313 -0.0007558505 -0.0013856663 0.00083280104 0.0011877542 0.0011832908 -0.0062740017 -0.0041656927 -0.00021244916 -0.003825751 -0.0008265599 0.0006630816 0.0017837059 0.0015800571 0.003450928 -6.108039e-05 0.0011729015 0.0028055524 0.00076883414 0.00010493064 0.002146886 0.00038040336 0.004965537 9.2754024e-05 -0.0018919347 0.008934973 0.00096617825 -0.006157818 0.0035910595 -0.0053489637 3.3219327e-05 -0.0006168987 0.0026911676 -0.0038254256 0.0026888787 -0.0007441478 0.0008444414 -0.0026174963 0.0022878484 -0.0023595123 -0.0015204635 0.0037259734 -5.2526884e-05 -0.0044293646 0.0034161485 -0.0019429636 0.0020538114 0.0006644315 -0.0016363993 -0.008649536 0.0031790049 -0.0039861407 -0.0027084544 -0.00048599934 -3.8090475e-05 0.0043773805 0.0072931093 -0.0034068334 -0.00021693278 -0.0012627814 -0.0041471873 -0.00050473155 -0.0025898006 0.0056118974 0.0014177968 0.0038515208 -0.004605488 -0.004040263 -0.000118552394 0.0027863332 -0.0005647173 0.0013541371 -8.995093e-05 0.0047252895 -0.0005520682 0.002276596 0.0053535793 -0.0033549203 -0.006223062 0.00021501358 -0.001158748 0.0036035262 -0.0042557623 0.0029478748 -0.00450261 -0.0060959747 -0.004205719 0.005584897 -0.004992586 0.0028404114 0.0011873195 0.0061516096 -0.0020272878 -0.004424016 0.00402932 -0.0001826192 -0.007942871 -0.00160977 -0.0023055896 -0.00039528546 -0.0028323259 -0.0057016495 0.00033210887 -0.003824361 -0.001077343 0.0004932442 -0.0011257724 0.0013448052 0.00071794126 -0.0006054143 0.00093989715 0.0006737041 0.00030451416 0.0015446004 0.00518079 -0.0015015822 0.006120493 -0.0029396496 0.002959119 0.0007964091 -0.0048238076 -0.0016891472 0.00546558 0.0006983296 -0.0022325253 0.0037183745 0.0026723347 0.0045041945 0.0021165516 -0.006289353 0.0036232779 -0.0045670904 0.0008849937 -0.00014536796 0.0021089097 0.00097423594-0.00024448746 -0.0010763125 0.003125771 0.004135623 -0.004771408 0.0049448 0.001144669 0.001291886 -0.001263642 0.0042251362 -0.0020988707 -0.0014346262 6.90749e-06 -0.0029211694 -0.0013026093 -0.0013415371 -0.0016230979 -0.00096547825 -0.001745597 0.00040512634 -0.0064845243 0.0006506791 -0.0006320075 0.0023020336 0.0012096177 0.0017494344 -0.0039613647 -0.003094501 0.0042632204 0.0006939406 -0.0022209722 0.0011751701 -0.0015695405 -0.0015775005 -2.3448052e-05 -0.0010673549 0.00036971894 0.0024835898 -0.003797661 0.00039992708 0.000526368 0.0029544581 -0.0040376564 -0.0023730933 -0.0022024265 0.00213513 -0.0010421213 -0.0042369138 -0.00075108395 -0.00036568893 0.0013544671 -0.0029839294 -0.002580529 -0.00047111767 -0.0041369535 -0.0001567938 0.0011693063 0.0042601796 -0.0006272771 0.00038506833 -0.00045637012 -0.00028139856 0.0013919713 -0.0019712213 0.0014180139 -0.0060420553 -0.0053746267 0.0005743905 -0.0018651137 0.001447313 0.00067517953 0.004650294 0.00018548207 0.00078926067 0.0007468785 0.0010020636 0.0032694095 0.0024725187 -0.00055238226 -0.00056581147 0.0022230416 -0.0015094996 0.00017952199 0.002900571 0.002591262 0.0032712088 0.001423943 -0.00546934 0.0020617654 -0.00019749034 0.0018827246 0.0049671405 -0.0035067024 0.0012461429 0.0024851055 -0.00078153453 -0.0022306135 0.00046582063 0.0033263667 -0.0045768293 0.0029003555 0.0051647453 -0.0012742868 -0.0016925328 0.00072854105 0.0012876516 0.003237974 -0.0018034519 -0.003014516 0.002189529 -0.002007535 -0.0013801891 -0.0061686407 -0.0037903676 -0.0005672278 -0.0018567832 -0.001686516 0.0025572036 0.002001232 -0.00027618674 -0.0025056354 0.0014768941 -0.0031048993 -0.0060148444 -0.0042588552 -0.00030948254 -0.0004356231 0.0035531423 0.0013889762 0.00055970415 0.00048380904 -0.0021559077 0.003267523 -0.00013677204 -0.0037915506 0.0024665247 0.00079362176 -0.0014013958 -0.0044578277 0.00013797707 -1.4622975e-06 0.0018670369 0.0048083747 -0.0013876919 3.801957e-05 0.004853644 -0.00023692533 -0.00019441282 0.00046272035 0.0012128565 0.0011040643 -0.0037835464 0.00026054904 -0.0018465086 -0.002440198 -0.0046957647 -4.3493757e-05 -0.00044916797 0.0021399045 -0.002710794 0.0016895827 0.0059075435 0.0014730116 0.0057453257 -0.0006814321 -0.0011403437 -0.0014802187 2.7851664e-05 0.00011174493 0.0012897353 -0.0041318485 -0.0028498399 -0.00072766945 -0.0012996459 0.00091890775 -0.00034061767 0.0006357798 -7.269426e-05 0.0016101556 0.0008198849 0.00042667583 0.0043164557 0.0017818286 -0.001019963 0.001763471 -0.0004308594 0.0029629325 -0.0010991407 0.0010782174 0.0061393455 -0.0004742098 -0.00420094 0.002011316 -0.005686726 0.0011265738 -0.002180837 0.0028711765 -0.0035301943 0.0028749495 -0.001288258 -0.002091718 -0.00021518185 0.0003428011 -0.0018176152 -0.0012838923 0.0033107034 -0.00091318303 -0.0020795283 0.0031776857 -0.0014245851 0.0015649332 0.001200612 -0.0002903755 -0.006647998 0.0037289578 -0.0028774373 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值