Intro
The Mahout Examples source comes with tools for classifying a Wikipedia data dump using either the Naive Bayes or Complementary Naive Bayes implementations in Mahout. The example (described below) gets a Wikipedia dump and then splits it up into chunks. These chunks are then further split by country. From these splits, a classifier is trained to predict what country an unseen article should be categorized into.
Running the example
- download the wikipedia data set here
- unzip the bz2 file to get the enwiki-latest-pages-articles.xml.
- Create directory $MAHOUT_HOME/examples/temp and copy the xml file into this directory
- Chunk the Data into pieces:
$MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o wikipedia/chunks -c 64
We strongly suggest you backup the results to some other place so that you don't have to do this step again in case it gets accidentally erased
- This would have created the chunks in HDFS. Verify the same by executing
hadoop fs -ls wikipedia/chunks
and it'll list all the xml files as chunk-0001.xml and so on.
- Create the countries based Split of wikipedia dataset.
$MAHOUT_HOME/bin/mahout wikipediaDataSetCreator -i wikipedia/chunks -o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt
- Verify the creation of input data set by executing
hadoop fs -ls wikipediainput
and you'll be able to see part-r-00000 file inside wikipediainput directory
- Train the classifier:
$MAHOUT_HOME/bin/mahout trainclassifier -i wikipediainput -o wikipediamodel
. The model file will be available in the wikipediamodel folder in HDFS.
- Test the classifier:
$MAHOUT_HOME/bin/mahout testclassifier -m wikipediamodel -d wikip