中文维基百科word2vec训练及其代码

参考文章:中英文维基百科语料上的Word2Vec实验

数据来自:https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

繁体转为简体: 

opencc -i zhwiki.txt -o zhwiki.txt.simle -c zht2zhs.ini
先把文件拆分为多个文件:
split  -l  30000  ../zhwiki.txt.simple seg
$ 多进程分词
cat multi_cut.py
import jieba
from multiprocessing import Pool,cpu_count



def cut(name):
    print(name)
    out = open('out/'+name,'w')
    with open('seg/'+name,'r') as f:
        while True:
            line = f.readline()
            if not line:
                break
            line = line.strip()
            wordss = []
            for ss in line.split(" "):
                sent = jieba.lcut(ss,cut_all=False)
                words = [i for i in sent]
                wordss.append(" ".join(words))
            s = ' '.join(wordss)
            out.write(s+"\n")
    out.close()
    # f = open(path,'r')
    # 读取数据
    # data = f.readlines()
    # f.close()
    # return
    #

if __name__ == '__main__':
    path = "zhwiki.txt.simple"
    files = "segaa  segab  segac  segad  segae  segaf  segag  segah  segai  segaj  segak".replace("  "," ").split(" ")
    print(files)
    pool = Pool(cpu_count()-1)
    data = pool.map(cut, files)


训练的代码来自互联网:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Pan Yang (panyangnlp@gmail.com)
# Copyrigh 2017

from __future__ import print_function

import logging
import os.path
import six
import sys

from gensim.corpora import WikiCorpus

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) != 3:
        print("Using: python process_wiki.py enwiki.xxx.xml.bz2 wiki.en.text")
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        if six.PY3:
            output.write(' '.join(text) + '\n')
        #   ###another method###
        #    output.write(
        #            space.join(map(lambda x:x.decode("utf-8"), text)) + '\n')
        else:
            output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值