word2vec训练中文模型的代码实现

word2vec训练中文模型


1.准备数据与预处理

首先需要一份比较大的中文语料数据,可以考虑中文的维基百科(也可以试试搜狗的新闻语料库)。

中文维基百科的打包文件地址为
链接: https://pan.baidu.com/s/1H-wuIve0d_fvczvy3EOKMQ 提取码: uqua

百度网盘加速下载地址:https://www.baiduwp.com/?m=index

中文维基百科的数据不是太大,xml的压缩文件大约1G左右。首先用处理这个XML压缩文件。

注意输入输出地址

import logging
import os.path
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
    
    # 定义输入输出
    basename = "F:/temp/DL/"
    inp = basename+'zhwiki-latest-pages-articles.xml.bz2'
    outp = basename+'wiki.zh.text'
    
    program = os.path.basename(basename)  # 获得文件名
    logger = logging.getLogger(program)  # 创建一个logger
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')  # 对日志的输出格式及方式做相关配置
    logging.root.setLevel(level=logging.INFO)  # 设置日志级别
    logger.info("running %s" % ' '.join(sys.argv))
    # check and process input arguments
    if len(sys.argv) < 3:
        print(globals()['__doc__'] % locals())
        sys.exit(1)
    
    space = " "
    i = 0
    output = open(outp, 'w',encoding='utf-8')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")


2019-05-08 21:42:31,184: INFO: running c:\users\mantch\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py -f C:\Users\mantch\AppData\Roaming\jupyter\runtime\kernel-30939db9-3a59-4a92-844c-704c6189dbef.json
2019-05-08 21:43:12,274: INFO: Saved 10000 articles
2019-05-08 21:43:45,223: INFO: Saved 20000 articles
2019-05-08 21:44:14,638: INFO: Saved 30000 articles
2019-05-08 21:44:44,601: INFO: Saved 40000 articles
2019-05-08 21:45:16,004: INFO: Saved 50000 articles
2019-05-08 21:45:47,421: INFO: Saved 60000 articles
2019-05-08 21:46:16,722: INFO: Saved 70000 articles
2019-05-08 21:46:46,733: INFO: Saved 80000 articles
2019-05-08 21:47:16,143: INFO: Saved 90000 articles
2019-05-08 21:47:47,533: INFO: Saved 100000 articles
2019-05-08 21:48:29,591: INFO: Saved 110000 articles
2019-05-08 21:49:04,530: INFO: Saved 120000 articles
2019-05-08 21:49:40,279: INFO: Saved 130000 articles
2019-05-08 21:50:15,592: INFO: Saved 140000 articles
2019-05-08 21:50:54,183: INFO: Saved 150000 articles
2019-05-08 21:51:31,123: INFO: Saved 160000 articles
2019-05-08 21:52:06,278: INFO: Saved 170000 articles
2019-05-08 21:52:43,157: INFO: Saved 180000 articles
2019-05-08 21:55:59,809: INFO: Saved 190000 articles
2019-05-08 21:57:01,859: INFO: Saved 200000 articles
2019-05-08 21:58:33,921: INFO: Saved 210000 articles
2019-05-08 21:59:26,744: INFO: Saved 220000 articles
2019-05-08 22:00:41,757: INFO: Saved 230000 articles
2019-05-08 22:01:36,532: INFO: Saved 240000 articles
2019-05-08 22:02:26,347: INFO: Saved 250000 articles
2019-05-08 22:03:08,634: INFO: Saved 260000 articles
2019-05-08 22:03:53,447: INFO: Saved 270000 articles
2019-05-08 22:04:37,136: INFO: Saved 280000 articles
2019-05-08 22:05:14,017: INFO: Saved 290000 articles
2019-05-08 22:06:01,296: INFO: Saved 300000 articles
2019-05-08 22:06:47,762: INFO: Saved 310000 articles
2019-05-08 22:07:39,714: INFO: Saved 320000 articles
2019-05-08 22:08:28,825: INFO: Saved 330000 articles
2019-05-08 22:09:11,412: INFO: finished iterating over Wikipedia corpus of 338005 documents with 77273203 positions (total 3288566 articles, 91445479 positions before pruning articles shorter than 50 words)
2019-05-08 22:09:11,555: INFO: Finished Saved 338005 articles

2.训练数据

Python的话可用jieba完成分词,生成分词文件wiki.zh.text.seg
接着用word2vec工具训练:

注意输入输出地址

import logging
import os.path
import sys
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

# 定义输入输出
basename = "F:/temp/DL/"
inp = basename+'wiki.zh.text'
outp1 = basename+'wiki.zh.text.model'
outp2 = basename+'wiki.zh.text.vector'

program = os.path.basename(basename)
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 4:
    print(globals()['__doc__'] % locals())

model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
        workers=multiprocessing.cpu_count())  # workers是统计cpu核数
# trim unneeded model memory = use(much) less RAM
#model.init_sims(replace=True)
model.save(outp1)
model.save_word2vec_format(outp2, binary=False)

输出如下:
2019-05-08 22:28:25,638: INFO: running c:\users\mantch\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py -f C:\Users\mantch\AppData\Roaming\jupyter\runtime\kernel-b1f915fd-fdb2-43fc-bcf3-b361fb4a7c3d.json
2019-05-08 22:28:25,640: INFO: collecting all words and their counts
2019-05-08 22:28:25,642: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
Automatically created module for IPython interactive environment
2019-05-08 22:28:27,887: INFO: PROGRESS: at sentence #10000, processed 4278620 words, keeping 2586311 word types
2019-05-08 22:28:29,666: INFO: PROGRESS: at sentence #20000, processed 7491125 words, keeping 4291863 word types
2019-05-08 22:28:31,445: INFO: PROGRESS: at sentence #30000, processed 10424455 words, keeping 5704507 word types
2019-05-08 22:28:32,854: INFO: PROGRESS: at sentence #40000, processed 13190001 words, keeping 6983862 word types
2019-05-08 22:28:34,125: INFO: PROGRESS: at sentence #50000, processed 15813238 words, keeping 8145905 word types
2019-05-08 22:28:35,353: INFO: PROGRESS: at sentence #60000, processed 18388731 words, keeping 9198885 word types
2019-05-08 22:28:36,544: INFO: PROGRESS: at sentence #70000, processed 20773000 words, keeping 10203788 word types
2019-05-08 22:28:37,652: INFO: PROGRESS: at sentence #80000, processed 23064544 words, keeping 11144885 word types
2019-05-08 22:28:39,490: INFO: PROGRESS: at sentence #90000, processed 25324650 words, keeping 12034343 word types
2019-05-08 22:28:40,688: INFO: PROGRESS: at sentence #100000, processed 27672540 words, keeping 12878856 word types
2019-05-08 22:28:41,871: INFO: PROGRESS: at sentence #110000, processed 29985282 words, keeping 13688622 word types
2019-05-08 22:28:42,944: INFO: PROGRESS: at sentence #120000, processed 32025045 words, keeping 14477767 word types
2019-05-08 22:28:44,048: INFO: PROGRESS: at sentence #130000, processed 34267390 words, keeping 15309447 word types
2019-05-08 22:28:45,197: INFO: PROGRESS: at sentence #140000, processed 36451394 words, keeping 16090548 word types
2019-05-08 22:28:46,345: INFO: PROGRESS: at sentence #150000, processed 38671717 words, keeping 16877015 word types
2019-05-08 22:28:47,483: INFO: PROGRESS: at sentence #160000, processed 40778409 words, keeping 17648492 word types
2019-05-08 22:28:48,655: INFO: PROGRESS: at sentence #170000, processed 43154040 words, keeping 18308373 word types
2019-05-08 22:28:49,759: INFO: PROGRESS: at sentence #180000, processed 45231681 words, keeping 19010906 word types
2019-05-08 22:28:50,826: INFO: PROGRESS: at sentence #190000, processed 47190144 words, keeping 19659373 word types
2019-05-08 22:28:51,886: INFO: PROGRESS: at sentence #200000, processed 49201934 words, keeping 20311518 word types
2019-05-08 22:28:52,856: INFO: PROGRESS: at sentence #210000, processed 51116197 words, keeping 20917125 word types
2019-05-08 22:28:53,859: INFO: PROGRESS: at sentence #220000, processed 53321151 words, keeping 21513016 word types
2019-05-08 22:28:54,921: INFO: PROGRESS: at sentence #230000, processed 55408211 words, keeping 22207241 word types
2019-05-08 22:28:59,645: INFO: PROGRESS: at sentence #240000, processed 57442276 words, keeping 22849499 word types
2019-05-08 22:29:00,988: INFO: PROGRESS: at sentence #250000, processed 59563975 words, keeping 23544817 word types
2019-05-08 22:29:02,292: INFO: PROGRESS: at sentence #260000, processed 61764248 words, keeping 24222911 word types
2019-05-08 22:29:03,654: INFO: PROGRESS: at sentence #270000, processed 63938511 words, keeping 24906453 word types
2019-05-08 22:29:04,900: INFO: PROGRESS: at sentence #280000, processed 66096661 words, keeping 25519781 word types
2019-05-08 22:29:06,057: INFO: PROGRESS: at sentence #290000, processed 67947209 words, keeping 26062482 word types
2019-05-08 22:29:07,229: INFO: PROGRESS: at sentence #300000, processed 69927780 words, keeping 26649878 word types
2019-05-08 22:29:08,506: INFO: PROGRESS: at sentence #310000, processed 71800313 words, keeping 27230264 word types
2019-05-08 22:29:09,836: INFO: PROGRESS: at sentence #320000, processed 73942427 words, keeping 27850568 word types
2019-05-08 22:29:11,419: INFO: PROGRESS: at sentence #330000, processed 75859220 words, keeping 28467061 word types
2019-05-08 22:29:12,379: INFO: collected 28914285 word types from a corpus of 77273203 raw words and 338042 sentences

3.测试结果


# 测试结果
import gensim

# 定义输入输出
basename = "F:/temp/DL/"
model = basename+'wiki.zh.text.model'

model = gensim.models.Word2Vec.load(model)

result = model.most_similar(u"足球")
for e in result:
    print(e[0], e[1])

c:\users\mantch\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py:11: DeprecationWarning: Call to deprecated most_similar (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
This is added back by InteractiveShellApp.init_path()

排球 0.8914323449134827
籃球 0.8889479041099548
棒球 0.854706883430481
高爾夫 0.832783043384552
高爾夫球 0.8316080570220947
網球 0.8276922702789307
橄欖球 0.823620080947876
英式足球 0.8229209184646606
板球 0.822044312953949
欖球 0.8151556253433228

result = model.most_similar(u"男人")
for e in result:
    print(e[0], e[1])

女人 0.908246636390686
男孩 0.872255802154541
女孩 0.8567496538162231
孩子 0.8363182544708252
知道 0.8341636061668396
某人 0.8211491107940674
漂亮 0.8023637533187866
伴侶 0.8001378774642944
什麼 0.7944830656051636
嫉妒 0.7929206490516663

c:\users\mantch\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated most_similar (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
“”"Entry point for launching an IPython kernel.

  • 2
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
BERT模型对语料库预处理的代码主要包括以下几个步骤: 1. 将原始语料库转化为BERT模型能够处理的格式 2. 对输入文本进行分词 3. 添加特殊标记,如[CLS]、[SEP]等 4. 对句子进行padding,使其长度相同 5. 构建输入的特征向量 下面是一个简单的BERT模型对语料库预处理的代码示例: ```python import torch from transformers import BertTokenizer from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler # 加载BERT分词器 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # 加载数据集 sentences = ["This is the first sentence.", "This is the second sentence."] labels = [0, 1] # 对文本进行分词和添加特殊标记 input_ids = [] for sentence in sentences: encoded_sent = tokenizer.encode(sentence, add_special_tokens=True) input_ids.append(encoded_sent) # 对句子进行padding MAX_LEN = 64 input_ids = torch.tensor([i + [0]*(MAX_LEN-len(i)) for i in input_ids]) # 构建attention masks attention_masks = [] for sent in input_ids: att_mask = [int(token_id > 0) for token_id in sent] attention_masks.append(att_mask) # 构建数据集 dataset = TensorDataset(input_ids, attention_masks, torch.tensor(labels)) # 构建数据加载器 batch_size = 32 dataloader = DataLoader(dataset, sampler=RandomSampler(dataset), batch_size=batch_size) ``` 以上代码中,我们首先加载了BERT分词器,然后对输入文本进行分词和特殊标记的添加,接着对句子进行padding,最后构建了输入特征向量和数据加载器。这样,我们就可以将预处理后的数据输入到BERT模型中进行训练或推理。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值