Wikipedia Bayes Example

最新推荐文章于 2021-12-02 21:57:16 发布

zhumin726

最新推荐文章于 2021-12-02 21:57:16 发布

阅读量687

点赞数

Intro

The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either the Naive Bayes or Complementary Naive Bayes implementations in Mahout. The example (described below) gets a Wikipedia dump and then splits it up into chunks. These chunks are then further split by country. From these splits, a classifier is trained to predict what country an unseen article should be categorized into.

Running the example

download the wikipedia data set here
unzip the bz2 file to get the enwiki-latest-pages-articles.xml.
Create directory $MAHOUT_HOME/examples/temp and copy the xml file into this directory
Chunk the Data into pieces:
```
$MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o wikipedia/chunks -c 64
```
We strongly suggest you backup the results to some other place so that you don't have to do this step again in case it gets accidentally erased
This would have created the chunks in HDFS. Verify the same by executing
```
hadoop fs -ls wikipedia/chunks
```
and it'll list all the xml files as chunk-0001.xml and so on.

Create the countries based Split of wikipedia dataset.

$MAHOUT_HOME/bin/mahout  wikipediaDataSetCreator  -i wikipedia/chunks -o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt

Verify the creation of input data set by executing
```
 hadoop fs -ls wikipediainput 
```
and you'll be able to see part-r-00000 file inside wikipediainput directory
Train the classifier:
```
$MAHOUT_HOME/bin/mahout trainclassifier -i wikipediainput -o wikipediamodel
```
. The model file will be available in the wikipediamodel folder in HDFS.

Test the classifier:

$MAHOUT_HOME/bin/mahout testclassifier -m wikipediamodel -d wikip

zhumin726

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫