Mahout0.8中贝叶斯分类器的使用方法

最新推荐文章于 2021-04-09 13:30:09 发布

zc02051126

最新推荐文章于 2021-04-09 13:30:09 发布

阅读量4.2k

点赞数

分类专栏：基于开源软件的机器学习平台

本文链接：https://blog.csdn.net/zc02051126/article/details/9965069

版权

基于开源软件的机器学习平台专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1.1 Mahout0.8中贝叶斯分类器的使用方法

在《京东大数据计算平台-Mahout0.6应用开发》文档的第三部分第一节中介绍了贝叶斯分类器对文本的分类，即Twenty Newsgroups例子；实际上还存在另一种运行方式，在0.6版本的安装目录下./examples/bin下有个脚本文件classifu-20newsgroups.sh，这个脚本中的逻辑和《京东大数据计算平台-Mahout0.6应用开发》是一致的，只不过将各个API用shell脚本封装到一起了。从0.7版本开始，Mahout移除了命令行调用的API：prepare20newsgroups，trainclassifier，testclassifier，只能通过shell脚本执行了。下面将详细介绍shell脚本的使用方法，以及其中的逻辑过程等信息。

1.1.1 环境配置

按照https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups说明需要设置环境变量

JAVA_HOME：JAVA的安装路径

HADOOP_HOME：HADOOP的安装路径

MAHOUT_HOME：MAHOUT的安装路径

同时将HADOOP和MAHOUT的bin目录加入到path路径中。在Ubuntu（Release12.04(precise) 64-bit）上，早当前用户目录内执行gedit .profile，在打开的.profile文件中加入如下内容：

exportJAVA_HOME=/java/home

exportHADOOP_HOME=/hadoop/home

exportMAHOUT_HOME=/mahout/home

exportPATH=$PATH:$JAVA_HOME/bin:$MAHOUT_HOME/bin:$HADOOP_HOME/bin

由于从http://mirrors.hust.edu.cn/apache/mahout/0.8/直接下载了编译好的Mahout0.8 mahout-distribution-0.8.tar.gz，所以不用执行编译过程。

MAHOUT_LOCAL与HADOOP_CONF_DIR

以上的连个参数是控制Mahout是在本地运行还是在Hadoop上运行的关键。

$MAHOUT_HOME/bin/mahout文件指出，只要设置MAHOUT_LOCAL的值为一个非空（not empty string）值，则不管用户有没有设置HADOOP_CONF_DIR和HADOOP_HOME这两个参数，Mahout都以本地模式运行；换句话说，如果要想Mahout运行在Hadoop上，则MAHOUT_LOCAL必须为空。

HADOOP_CONF_DIR参数指定Mahout运行Hadoop模式时使用的Hadoop配置信息，这个文件目录一般指向的是$HADOOP_HOME目录下的conf目录。

所以要想本地运行MAHOUT0.8还必须在.profile文件中设置环境变量MAHOUT_LOCAL为

export MAHOUT_LOCAL=true

如果不设置该变量将无法执行或者报错

1.1.2 通过shell执行Twenty Newsgroups例子

cd $MAHOUT_HOME

./examples/bin/classify-20newsgroups.sh

在执行脚本的过程中如果出现网络问题导致无法下载20news-bydate.tar.gz，导致后续过程无法进行，可以手动下载到执行目录（/tmp/mahout-work-${USER}，USER为计算机名）再重新执行shell脚本即可。

1.1.3 shell脚本逻辑解析

#!/bin/bash

# Licensed tothe Apache Software Foundation (ASF) under one or more

# contributorlicense agreements. See the NOTICE filedistributed with

# this work foradditional information regarding copyright ownership.

# The ASFlicenses this file to You under the Apache License, Version 2.0

# (the"License"); you may not use this file except in compliance with

# theLicense. You may obtain a copy of theLicense at

# http://www.apache.org/licenses/LICENSE-2.0

# Unlessrequired by applicable law or agreed to in writing, software

# distributedunder the License is distributed on an "AS IS" BASIS,

# WITHOUTWARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See theLicense for the specific language governing permissions and

# limitationsunder the License.

# Downloads the20newsgroups dataset, trains and tests a classifier.

# To run: change into the mahout directory and type:

#examples/bin/classify-20newsgroups.sh

if ["$1" = "--help" ] || [ "$1" = "--?" ];then

echo "This script runs SGD and Bayesclassifiers over the classic 20 News Groups."

exit

SCRIPT_PATH=${0%/*}

if ["$0" != "$SCRIPT_PATH" ] && ["$SCRIPT_PATH" != "" ]; then

cd $SCRIPT_PATH

START_PATH=`pwd`

WORK_DIR=/tmp/mahout-work-${USER}

algorithm=(cnaivebayes naivebayes sgd clean)

if [ -n"$1" ]; then

choice=$1

else

echo "Please select a number to choosethe corresponding task to run"

echo "1. ${algorithm[0]}"

echo "2. ${algorithm[1]}"

echo "3. ${algorithm[2]}"

echo "4. ${algorithm[3]} -- cleans upthe work area in $WORK_DIR"

read -p "Enter your choice : "choice

echo "ok.You chose $choice and we'll use ${algorithm[$choice-1]}"

alg=${algorithm[$choice-1]}

echo"creating work directory at ${WORK_DIR}"

mkdir -p${WORK_DIR}

if [ ! -e${WORK_DIR}/20news-bayesinput ]; then

if [ ! -e ${WORK_DIR}/20news-bydate ]; then

if [ ! -f ${WORK_DIR}/20news-bydate.tar.gz]; then

echo "Downloading20news-bydate"

curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz-o ${WORK_DIR}/20news-bydate.tar.gz

mkdir -p ${WORK_DIR}/20news-bydate

echo "Extracting..."

cd ${WORK_DIR}/20news-bydate && tarxzf ../20news-bydate.tar.gz && cd .. && cd ..

#echo$START_PATH

cd $START_PATH

cd ../..

set -e

if ["x$alg" == "xnaivebayes" -o "x$alg" =="xcnaivebayes" ]; then

c=""

if [ "x$alg" =="xcnaivebayes" ]; then

c=" -c"

set -x

echo "Preparing 20newsgroups data"

rm -rf ${WORK_DIR}/20news-all

mkdir ${WORK_DIR}/20news-all

cp -R ${WORK_DIR}/20news-bydate/*/*${WORK_DIR}/20news-all

#将20newsgroups数据转化为序列化格式的文件

echo "Creating sequence files from20newsgroups data"

./bin/mahout seqdirectory \

-i ${WORK_DIR}/20news-all \

-o ${WORK_DIR}/20news-seq -ow

#将序列化格式的文本文件转化为向量

echo "Converting sequence files tovectors"

./bin/mahout seq2sparse \

-i ${WORK_DIR}/20news-seq \

-o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf

#将向量数据随机拆分成两份80-20，分别用于训练集合测试集

echo "Creating training and holdout setwith a random 80-20 split of the generated vector dataset"

./bin/mahout split \

-i ${WORK_DIR}/20news-vectors/tfidf-vectors\

--trainingOutput${WORK_DIR}/20news-train-vectors \

--testOutput ${WORK_DIR}/20news-test-vectors \

--randomSelectionPct 40 --overwrite--sequenceFiles -xm sequential

#训练贝叶斯网络

echo "Training Naive Bayes model"

./bin/mahout trainnb \

-i ${WORK_DIR}/20news-train-vectors -el \

-o ${WORK_DIR}/model \

-li ${WORK_DIR}/labelindex \

-ow $c

#用训练数据作为测试集，产生的误差为训练误差

echo "Self testing on training set"

./bin/mahout testnb \

-i ${WORK_DIR}/20news-train-vectors\

-m ${WORK_DIR}/model \

-l ${WORK_DIR}/labelindex \

-ow -o ${WORK_DIR}/20news-testing $c

#用测试集测试，产生的误差为测试误差

echo "Testing on holdout set"

./bin/mahout testnb \

-i ${WORK_DIR}/20news-test-vectors\

-m ${WORK_DIR}/model \

-l ${WORK_DIR}/labelindex \

-ow -o ${WORK_DIR}/20news-testing $c

elif ["x$alg" == "xsgd" ]; then

if [ ! -e "/tmp/news-group.model"]; then

echo "Training on${WORK_DIR}/20news-bydate/20news-bydate-train/"

./bin/mahoutorg.apache.mahout.classifier.sgd.TrainNewsGroups${WORK_DIR}/20news-bydate/20news-bydate-train/

echo "Testing on ${WORK_DIR}/20news-bydate/20news-bydate-test/with model: /tmp/news-group.model"

./bin/mahoutorg.apache.mahout.classifier.sgd.TestNewsGroups --input${WORK_DIR}/20news-bydate/20news-bydate-test/ --model /tmp/news-group.model

elif ["x$alg" == "xclean" ]; then

rm -rf ${WORK_DIR}

rm -rf /tmp/news-group.model

# Remove thework directory

zc02051126

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录