主要参考:
https://www.cnblogs.com/chenbjin/p/5635853.html
解析zhwiki的时候,不要解压,不要解压,不要解压!
1、数据来源
https://dumps.wikimedia.org/zhwiki/20181001/
zhwiki-20181001-pages-articles.xml.bz2
2、使用Gensim抽取
2.1、mac安装Gensim,参考:https://radimrehurek.com/gensim/install.html
pip install --upgrade gensim
有权限问题,找不到库,加 --user,安装Gensim依赖的包也一样,加 --user:
pip install --upgrade --user gensim
2.2、抽取:
python preprocess_wiki_1.py zhwiki-20181001-pages-articles.xml.bz2 zhwiki-20181001.txt
# -*- coding: utf-8 -*-
import logging
import sys
from gensim.corpora import WikiCorpus
reload(sys)
sys.setdefaultencoding('utf8')
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)
'''
extract data from wiki dumps(*articles.xml.bz2) by gensim.
'''
def help():
print "Usage: python preprocess_wiki_1.py zhwiki-20181001-pages-articles.xml.bz2 zhwiki-20181001.txt "
if __name__ == '__main__':
if len(sys.argv) < 3:
help()
sys.exit(1)
logging.info("running %s" % ' '.join(sys.argv))
inp, outp = sys.argv[1:3]
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(" ".join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
logging.info("Save "+str(i) + " articles")
output.close()
logging.info("Finished saved "+str(i) + "articles")
3、繁体转简体
安装opencc,参考:https://blog.csdn.net/zhyongwei/article/details/79592162
brew install OpenCC
Homebrew安装与卸载参考:https://blog.csdn.net/sir_coding/article/details/77509602
4、分词
jieba分词器
5、conv 转 utf8
#!/bin/bash
# preprocess data
# Traditional Chinese to Simplified Chinese
echo "opencc: Traditional Chinese to Simplified Chinese..."
#time opencc -i wiki.zh.txt -o wiki.zh.chs.txt -c zht2zhs.ini
time opencc -i zhwiki-20181001.txt -o zhwiki-20181001.chs.txt -c t2s.json
#time opencc -i zhwiki-20181001.txt -o zhwiki-20181001.chs.txt -c zht2zhs.ini
# Cut words
echo "jieba: Cut words..."
time python -m jieba -d ' ' zhwiki-20181001.chs.txt > zhwiki-20181001.chs.seg.txt
# Change encode
echo "iconv: ascii to utf-8..."
time iconv -c -t UTF-8 < zhwiki-20181001.chs.seg.txt > zhwiki-20181001.chs.seg.utf.txt