Prerequisite:
Install OpenCC in Ubuntu with:
sudo apt-get install opencc
Then run with:
opencc <options>
How to use:
- Download data from wiki dumps: e.g.
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
or
https://dumps.wikimedia.org/zhwiki/ - Use WikiExtractor to extract title and contents from xml:
python WikiExtractor.py -b 500M -o extracted --json zhwiki-latest-pages-articles.xml.bz2
Note: the--json
flag will export file into json instead of html-like text.
ref: https://github.com/attardi/wikiextractor - (Optional) Use OpenCC to convert traditional Chinese to simplified Chinese.
opencc -i wiki_00 -o zh_wiki_c zht2zhs.ini
- Run
convert_symbols.py
to extract text only
Reference:
https://blog.csdn.net/u013421941/article/details/68947622