语料收集

最新推荐文章于 2021-12-22 22:54:21 发布

qq_41212157

最新推荐文章于 2021-12-22 22:54:21 发布

阅读量383

点赞数

本文链接：https://blog.csdn.net/qq_41212157/article/details/102581223

版权

Prerequisite:

Install OpenCC in Ubuntu with:

sudo apt-get install opencc

Then run with:

opencc <options>

Download data from wiki dumps: e.g.
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
or
https://dumps.wikimedia.org/zhwiki/
Use WikiExtractor to extract title and contents from xml:
python WikiExtractor.py -b 500M -o extracted --json zhwiki-latest-pages-articles.xml.bz2
Note: the --json flag will export file into json instead of html-like text.
ref: https://github.com/attardi/wikiextractor
(Optional) Use OpenCC to convert traditional Chinese to simplified Chinese.
opencc -i wiki_00 -o zh_wiki_c zht2zhs.ini
Run convert_symbols.py to extract text only

优惠劵

关注关注