多进程使用wikimedia数据训练word2vec模型

语料库下载:

请参考:https://blog.csdn.net/weixin_35757704/article/details/115614112

1.训练Word2vec模型代码

单单使用gensim库训练word2vec模型的代码请参考:https://blog.csdn.net/weixin_35757704/article/details/115601271

2. 剔除停用词

主要代码是:

def get_stop_words(filepath) -> list:
    return open(filepath, 'r', encoding='utf-8').readlines()[0].split(',')
    
def filter_stop_words(content:list) -> list:
    """剔除停用词
    """
    clean_content = []
    stop_words = get_stop_words('en_stopwords_line.txt')
    for word in content:
        if word not in stop_words:
            clean_content.append(word)
    return clean_content

停用词表en_stopwords_line.txt内容是,复制到一个文本中就行,注意这里是一行数据:

'd,'ll,'m,'re,'s,'t,'ve,ZT,ZZ,a,a's,able,about,above,abst,accordance,according,accordingly,across,act,actually,added,adj,adopted,affected,affecting,affects,after,afterwards,again,against,ah,ain't,all,allow,allows,almost,alone,along,already,also,although,always,am,among,amongst,an,and,announce,another,any,anybody,anyhow,anymore,anyone,anything,anyway,anyways,anywhere,apart,apparently,appear,appreciate,appropriate,approximately,are,area,areas,aren,aren't,arent,arise,around,as,aside,ask,asked,asking,asks,associated,at,auth,available,away,awfully,b,back,backed,backing,backs,be,became,because,become,becomes,becoming,been,before,beforehand,began,begin,beginning,beginnings,begins,behind,being,beings,believe,below,beside,besides,best,better,between,beyond,big,biol,both,brief,briefly,but,by,c,c'mon,c's,ca,came,can,can't,cannot,cant,case,cases,cause,causes,certain,certainly,changes,clear,clearly,co,com,come,comes,concerning,consequently,consider,considering,contain,containing,contains,corresponding,could,couldn't,couldnt,course,currently,d,date,definitely,describe,described,despite,did,didn't,differ,different,differently,discuss,do,does,doesn't,doing,don't,done,down,downed,downing,downs,downwards,due,during,e,each,early,ed,edu,effect,eg,eight,eighty,either,else,elsewhere,end,ended,ending,ends,enough,entirely,especially,et,et-al,etc,even,evenly,ever,every,everybody,everyone,everything,everywhere,ex,exactly,example,except,f,face,faces,fact,facts,far,felt,few,ff,fifth,find,finds,first,five,fix,followed,following,follows,for,former,formerly,forth,found,four,from,full,fully,further,furthered,furthering,furthermore,furthers,g,gave,general,generally,get,gets,getting,give,given,gives,giving,go,goes,going,gone,good,goods,got,gotten,great,greater,greatest,greetings,group,grouped,grouping,groups,h,had,hadn't,happens,hardly,has,hasn't,have,haven't,having,he,he's,hed,hello,help,hence,her,here,here's,hereafter,hereby,herein,heres,hereupon,hers,herself,hes,hi,hid,high,higher,highest,him,himself,his,hither,home,hopefully,how,howbeit,however,hundred,i,i'd,i'll,i'm,i've,id,ie,if,ignored,im,immediate,immediately,importance,important,in,inasmuch,inc,include,indeed,index,indicate,indicated,indicates,information,inner,insofar,instead,interest,interested,interesting,interests,into,invention,inward,is,isn't,it,it'd,it'll,it's,itd,its,itself,j,just,k,keep,keeps,kept,keys,kg,kind,km,knew,know,known,knows,l,large,largely,last,lately,later,latest,latter,latterly,least,less,lest,let,let's,lets,like,liked,likely,line,little,long,longer,longest,look,looking,looks,ltd,m,made,mainly,make,makes,making,man,many,may,maybe,me,mean,means,meantime,meanwhile,member,members,men,merely,mg,might,million,miss,ml,more,moreover,most,mostly,mr,mrs,much,mug,must,my,myself,n,n't,na,name,namely,nay,nd,near,nearly,necessarily,necessary,need,needed,needing,needs,neither,never,nevertheless,new,newer,newest,next,nine,ninety,no,nobody,non,none,nonetheless,noone,nor,normally,nos,not,noted,nothing,novel,now,nowhere,number,numbers,o,obtain,obtained,obviously,of,off,often,oh,ok,okay,old,older,oldest,omitted,on,once,one,ones,only,onto,open,opened,opening,opens,or,ord,order,ordered,ordering,orders,other,others,otherwise,ought,our,ours,ourselves,out,outside,over,overall,owing,own,p,page,pages,part,parted,particular,particularly,parting,parts,past,per,perhaps,place,placed,places,please,plus,point,pointed,pointing,points,poorly,possible,possibly,potentially,pp,predominantly,present,presented,presenting,presents,presumably,previously,primarily,probably,problem,problems,promptly,proud,provides,put,puts,q,que,quickly,quite,qv,r,ran,rather,rd,re,readily,really,reasonably,recent,recently,ref,refs,regarding,regardless,regards,related,relatively,research,respectively,resulted,resulting,results,right,room,rooms,run,s,said,same,saw,say,saying,says,sec,second,secondly,seconds,section,see,seeing,seem,seemed,seeming,seems,seen,sees,self,selves,sensible,sent,serious,seriously,seven,several,shall,she,she'll,shed,shes,should,shouldn't,show,showed,showing,shown,showns,shows,side,sides,significant,significantly,similar,similarly,since,six,slightly,small,smaller,smallest,so,some,somebody,somehow,someone,somethan,something,sometime,sometimes,somewhat,somewhere,soon,sorry,specifically,specified,specify,specifying,state,states,still,stop,strongly,sub,substantially,successfully,such,sufficiently,suggest,sup,sure,t,t's,take,taken,taking,tell,tends,th,than,thank,thanks,thanx,that,that'll,that's,that've,thats,the,their,theirs,them,themselves,then,thence,there,there'll,there's,there've,thereafter,thereby,thered,therefore,therein,thereof,therere,theres,thereto,thereupon,these,they,they'd,they'll,they're,they've,theyd,theyre,thing,things,think,thinks,third,this,thorough,thoroughly,those,thou,though,thoughh,thought,thoughts,thousand,three,throug,through,throughout,thru,thus,til,tip,to,today,together,too,took,toward,towards,tried,tries,truly,try,trying,ts,turn,turned,turning,turns,twice,two,u,un,under,unfortunately,unless,unlike,unlikely,until,unto,up,upon,ups,us,use,used,useful,usefully,usefulness,uses,using,usually,uucp,v,value,various,very,via,viz,vol,vols,vs,w,want,wanted,wanting,wants,was,wasn't,way,ways,we,we'd,we'll,we're,we've,wed,welcome,well,wells,went,were,weren't,what,what'll,what's,whatever,whats,when,whence,whenever,where,where's,whereafter,whereas,whereby,wherein,wheres,whereupon,wherever,whether,which,while,whim,whither,who,who'll,who's,whod,whoever,whole,whom,whomever,whos,whose,why,widely,will,willing,wish,with,within,without,won't,wonder,words,work,worked,working,works,world,would,wouldn't,www,x,y,year,years,yes,yet,you,you'd,you'll,you're,you've,youd,young,younger,youngest,your,youre,yours,yourself,yourselves,z,zero,zt,zz

3. 词型还原

使用NLTK做词型还原:

from nltk.stem import WordNetLemmatizer


def word_lemmatize(all_content):
    """词性还原 Lemmatization
    """
    lemmatize = WordNetLemmatizer()
    for i, content in enumerate(all_content):
        word = all_content[i]
        word = lemmatize.lemmatize(word, pos='v')
        word = lemmatize.lemmatize(word, pos='n')
        all_content[i] = lemmatize.lemmatize(word, pos='a')
    return all_content

4. 多进程

多进程请参考:https://blog.csdn.net/weixin_35757704/article/details/115674954

全部代码

import gensim
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec, word2vec
import multiprocessing


def get_stop_words(filepath) -> list:
    return open(filepath, 'r', encoding='utf-8').readlines()[0].split(',')


def write_in_file(str_line):
    with open('train_wiki.txt', 'a+') as f:
        line = " ".join(str_line) + "\n"
        f.write(line)


def filter_stop_words(content) -> list:
    """剔除停用词
    """
    clean_content = []
    stop_words = get_stop_words('en_stopwords_line.txt')
    for word in content:
        if word not in stop_words:
            clean_content.append(word)
    return clean_content


def multiplication(content):
    content = lower_word(content)
    content = filter_stop_words(content)
    content = word_lemmatize(content)
    return content


def lower_word(all_content):
    for i, content in enumerate(all_content):
        all_content[i] = content.lower()  # 文本转小写
    return all_content


def word_lemmatize(all_content):
    """词性还原 Lemmatization
    """
    lemmatize = WordNetLemmatizer()
    for i, content in enumerate(all_content):
        word = all_content[i]
        word = lemmatize.lemmatize(word, pos='v')
        word = lemmatize.lemmatize(word, pos='n')
        all_content[i] = lemmatize.lemmatize(word, pos='a')
    return all_content


def train_word2vec(words_file):
    # 可以用BrownCorpus,Text8Corpus或lineSentence来构建sentences
    sentences = list(word2vec.LineSentence(words_file))  # 加载分词后的文件
    model = Word2Vec(sentences, vector_size=350, window=5, sg=0, workers=multiprocessing.cpu_count())
    return model


if __name__ == '__main__':
    # 提取字符
    wiki = gensim.corpora.WikiCorpus('enwiki-latest-pages-articles.xml.bz2', dictionary={}) # 这是从维基百科官网下载的英文语料
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    for text in wiki.get_texts():  # 处理每一行
        pool.apply_async(func=multiplication, args=(text,), callback=write_in_file)
    print('Beigin Word2vec train')
    # 开始训练word2vec模型
    model = train_word2vec(words_file='train_wiki.txt')
    model.save('wiki_word2vec.model')
    print('WIKI WORD2VEC MODEL FINISH')
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

呆萌的代Ma

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值