Wiki官方提供了下载链接:https://dumps.wikimedia.org/zhwiki/latest/
本文处理的中文wiki:zhwiki-latest-pages-articles.xml.bz2
本文处理的英文wiki:enwiki-latest-pages-articles.xml.bz2
1,数据抽取,将*.xml.bz2转为可编辑txt
#process_wiki.py
# -*- coding: utf-8 -*-
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
inp="enwiki-latest-pages-articles.xml.bz2"
i = 0
output_file="wiki_englist_%07d.txt"%i
output = open(output_file, 'w',encoding="utf-8")
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write("".join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
output.close()
output_file = "wiki_englist_%07d.txt" % i
output = open(output_file, 'w', encoding="utf-8")
print("Save "+str(i) + " articles")
output.close()
print("Finished saved "+str(i) + "articles")
2,繁体转简体
使用opencc工具,https://code.google.com/archive/p/opencc/downloads
https://code.google.com/archive/p/opencc/downloads
-i:输入文件
-o:输出文件
-c:配置文件,zht2zhs.ini为繁体到简体转化
3,字符编码转换
iconv -c -t UTF-8 < input_file > output_file
4,分词处理
https://github.com/fxsjy/jieba
pip install jieba
python -m jieba input_file > cut_file
或者使用FoolNLTK
https://github.com/rockyzhengwu/FoolNLTK
pip install foolnltk
或者jieba_fast
https://github.com/deepcs233/jieba_fast
pip install jieba_fast