伪分布式hadoop+mahout部署及20newsgroups经典算法测试

--------------------------------------------------------------------------
第一阶段:hadoop的伪分布式安装

第二阶段:mahout的安装

第三阶段:20newsgroups的bayes算法测试
-------------------------------------------------------------------------
注意:安装完vmwaretools必须重启centos才可以生效
第一阶段:hadoop的伪分布式安装
1.JDK的安装
1.1解压hadoop安装包卸载hadoop自带的jdk
1. 检验系统原版本: 命令行 # java -version
查看详细信息 # rpm -qa | grep java
卸载自带的: 命令行 # rpm -e --nodeps
卸载OpenJDK,执行以下操作
[root@Centos 桌面]# rpm -e --nodeps 版本信息
复查 # rpm -qa | grep java 无输出表示卸载干净了
----------------------------------------------------------------------------------
[root@Centos 桌面]# java -version
java version "1.7.0_09-icedtea"
OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)
[root@Centos 桌面]# rpm -qa | grep java
tzdata-java-2012j-1.el6.noarch
java-1.7.0-openjdk-1.7.0.9-2.3.4.1.el6_3.x86_64
java-1.6.0-openjdk-1.6.0.0-1.50.1.11.5.el6_3.x86_64
[root@Centos 桌面]# rpm -e --nodeps tzdata-java-2012j-1.el6.noarch
[root@Centos 桌面]# rpm -e --nodeps java-1.7.0-openjdk-1.7.0.9-2.3.4.1.el6_3.x86_64
[root@Centos 桌面]# rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.50.1.11.5.el6_3.x86_64
[root@Centos 桌面]# rpm -qa | grep java
[root@Centos 桌面]#
----------------------------------------------------------------------------------
1.2 安装自己下载的jdk配置环境变量
1.解压安装jdk
------------------------------------------------------------
[root@Centos 桌面]# cd /root
[root@Centos ~]# tar zxvf jdk-8u65-linux-x64.gz
------------------------------------------------------------
2.配置环境变量
1.编辑/etc/profile文件 命令行 vi /etc/profile
[root@Centos ~]# vi /etc/profile
2.配置环境变量---在/etc/profile文件里添加jdk路径
export JAVA_HOME=/root/jdk1.8.0_65
export JRE_HOME=/root/jdk1.8.0_65/jre
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin

3.保存生效
[root@Centos ~]# source /etc/profile
[root@Centos ~]# echo $JAVA_HOME
3. 验证安装
执行以下操作,查看信息是否正常:
[root@Centos ~]# java
[root@Centos ~]# javac
[root@Centos ~]# java -version
---------------------------------------------------------------------------------
[root@Centos ~]# java
用法: java [-options] class [args...]
(执行类)
或 java [-options] -jar jarfile [args...]
(执行 jar 文件)
其中选项包括:
-d32 使用 32 位数据模型 (如果可用)
-d64 使用 64 位数据模型 (如果可用)
-server 选择 "server" VM
默认 VM 是 server.

-cp <目录和 zip/jar 文件的类搜索路径>
-classpath <目录和 zip/jar 文件的类搜索路径>
用 : 分隔的目录, JAR 档案
和 ZIP 档案列表, 用于搜索类文件。
-D<名称>=<值>
设置系统属性
-verbose:[class|gc|jni]
启用详细输出
-version 输出产品版本并退出
-version:<值>
警告: 此功能已过时, 将在
未来发行版中删除。
需要指定的版本才能运行
-showversion 输出产品版本并继续
-jre-restrict-search | -no-jre-restrict-search
警告: 此功能已过时, 将在
未来发行版中删除。
在版本搜索中包括/排除用户专用 JRE
-? -help 输出此帮助消息
-X 输出非标准选项的帮助
-ea[:<packagename>...|:<classname>]
-enableassertions[:<packagename>...|:<classname>]
按指定的粒度启用断言
-da[:<packagename>...|:<classname>]
-disableassertions[:<packagename>...|:<classname>]
禁用具有指定粒度的断言
-esa | -enablesystemassertions
启用系统断言
-dsa | -disablesystemassertions
禁用系统断言
-agentlib:<libname>[=<选项>]
加载本机代理库 <libname>, 例如 -agentlib:hprof
另请参阅 -agentlib:jdwp=help 和 -agentlib:hprof=help
-agentpath:<pathname>[=<选项>]
按完整路径名加载本机代理库
-javaagent:<jarpath>[=<选项>]
加载 Java 编程语言代理, 请参阅 java.lang.instrument
-splash:<imagepath>
使用指定的图像显示启动屏幕
有关详细信息, 请参阅 http://www.oracle.com/technetwork/java/javase/documentation/index.html。
[root@Centos ~]# javac
用法: javac <options> <source files>
其中, 可能的选项包括:
-g 生成所有调试信息
-g:none 不生成任何调试信息
-g:{lines,vars,source} 只生成某些调试信息
-nowarn 不生成任何警告
-verbose 输出有关编译器正在执行的操作的消息
-deprecation 输出使用已过时的 API 的源位置
-classpath <路径> 指定查找用户类文件和注释处理程序的位置
-cp <路径> 指定查找用户类文件和注释处理程序的位置
-sourcepath <路径> 指定查找输入源文件的位置
-bootclasspath <路径> 覆盖引导类文件的位置
-extdirs <目录> 覆盖所安装扩展的位置
-endorseddirs <目录> 覆盖签名的标准路径的位置
-proc:{none,only} 控制是否执行注释处理和/或编译。
-processor <class1>[,<class2>,<class3>...] 要运行的注释处理程序的名称; 绕过默认的搜索进程
-processorpath <路径> 指定查找注释处理程序的位置
-parameters 生成元数据以用于方法参数的反射
-d <目录> 指定放置生成的类文件的位置
-s <目录> 指定放置生成的源文件的位置
-h <目录> 指定放置生成的本机标头文件的位置
-implicit:{none,class} 指定是否为隐式引用文件生成类文件
-encoding <编码> 指定源文件使用的字符编码
-source <发行版> 提供与指定发行版的源兼容性
-target <发行版> 生成特定 VM 版本的类文件
-profile <配置文件> 请确保使用的 API 在指定的配置文件中可用
-version 版本信息
-help 输出标准选项的提要
-A关键字[=值] 传递给注释处理程序的选项
-X 输出非标准选项的提要
-J<标记> 直接将 <标记> 传递给运行时系统
-Werror 出现警告时终止编译
@<文件名> 从文件读取选项和文件名

[root@Centos ~]# java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

·······························JDK安装完成·····································
2. hadoop的安装开始
1.在hadoop的conf目录下配置 hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml
1.1 在hadoop-env.sh里的配置hadoop的JDK环境
---------------------------------------------
[root@Centos ~]# cd hadoop-1.2.1/
[root@Centos hadoop-1.2.1]# cd conf
[root@Centos conf]# vi hadoop-env.sh
---------------------------------------------
配置信息如下:
export JAVA_HOME=/root/jdk1.8.0_65
1.2 在core-site.xml里的配置hadoop的HDFS地址及端口号
------------------------------------------------
[root@Centos conf]# vi core-site.xml
------------------------------------------------
配置信息如下:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
1.3 在hdfs-site.xml里的配置hadoop的HDFS的配置
-------------------------------------------------
[root@Centos conf]# vi hdfs-site.xml
-------------------------------------------------
配置信息如下:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
1.4 在mapred-site.xml里的配置hadoop的HDFS的配置
-------------------------------------------------
[root@Centos conf]# vi mapred-site.xml
--------------------------------------------
配置信息如下:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
--------------------------------------------------------------------
[root@Centos conf]# vi hadoop-env.sh
[root@Centos conf]# vi core-site.xml
[root@Centos conf]# vi hdfs-site.xml
[root@Centos conf]# vi mapred-site.xml
--------------------------------------------------------------------
2.ssh免密码登录
--------------------------------------------------------------------
[root@Centos conf]# cd /root
[root@Centos ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
ed:48:64:29:62:37:c1:e9:3d:84:bf:ad:4e:50:5e:66 root@Centos
The key's randomart image is:
+--[ RSA 2048]----+
| ..o |
| +... |
| o.++= E |
| . o.B+= |
| . S+. |
| o.o. |
| o.. |
| .. |
| .. |
+-----------------+
c[root@Centos ~]# cd .ssh
[root@Centos .ssh]# ls
id_rsa id_rsa.pub
[root@Centos .ssh]# cp id_rsa.pub authorized_keys
[root@Centos .ssh]# ls
authorized_keys id_rsa id_rsa.pub
[root@Centos .ssh]# ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 3f:84:db:2f:53:a9:09:a6:61:a2:3a:82:80:6c:af:1a.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
-------------------------------------------------------------------------------
验证免密码登录
-------------------------------------------------------------------------------
[root@Centos ~]# ssh localhost
Last login: Sun Apr 3 23:19:51 2016 from localhost
[root@Centos ~]# exit
logout
Connection to localhost closed.
[root@Centos ~]# ssh localhost
Last login: Sun Apr 3 23:20:12 2016 from localhost
[root@Centos ~]# exit
logout
Connection to localhost closed.
[root@Centos ~]#
----------------------------SSH免密码登录设置成功----------------------------
3.格式化HDFS
命令行 # bin/hadoop namenode -format
-----------------------------------------------------------------------------
[root@Centos ~]# cd /root/hadoop-1.2.1/
[root@Centos hadoop-1.2.1]# bin/hadoop namenode -format
16/04/03 23:24:12 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = java.net.UnknownHostException: Centos: Centos: unknown error
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG: java = 1.8.0_65
************************************************************/
16/04/03 23:24:13 INFO util.GSet: Computing capacity for map BlocksMap
16/04/03 23:24:13 INFO util.GSet: VM type = 64-bit
16/04/03 23:24:13 INFO util.GSet: 2.0% max memory = 1013645312
16/04/03 23:24:13 INFO util.GSet: capacity = 2^21 = 2097152 entries
16/04/03 23:24:13 INFO util.GSet: recommended=2097152, actual=2097152
16/04/03 23:24:15 INFO namenode.FSNamesystem: fsOwner=root
16/04/03 23:24:15 INFO namenode.FSNamesystem: supergroup=supergroup
16/04/03 23:24:15 INFO namenode.FSNamesystem: isPermissionEnabled=true
16/04/03 23:24:15 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
16/04/03 23:24:15 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
16/04/03 23:24:15 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
16/04/03 23:24:15 INFO namenode.NameNode: Caching file names occuring more than 10 times
16/04/03 23:24:17 INFO common.Storage: Image file /tmp/hadoop-root/dfs/name/current/fsimage of size 110 bytes saved in 0 seconds.
16/04/03 23:24:18 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/tmp/hadoop-root/dfs/name/current/edits
16/04/03 23:24:18 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/tmp/hadoop-root/dfs/name/current/edits
16/04/03 23:24:18 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
16/04/03 23:24:18 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: Centos: Centos: unknown error
************************************************************/
-----------------------------------------------------------------------------
格式化节点报错:Centos: unknown error--------别着急紧接着下一步配置
--------------------------------------------------------------------------
[root@Centos hadoop-1.2.1]# vi /etc/hosts
配置信息如下:
127.0.0.1 localhost Centos
-------------------------------------------------------------------------
再一次进行格式化
--------------------------------------------------------------------------
[root@Centos hadoop-1.2.1]# vi /etc/hosts
[root@Centos hadoop-1.2.1]# bin/hadoop namenode -format
16/04/03 23:26:30 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = Centos/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG: java = 1.8.0_65
************************************************************/
Re-format filesystem in /tmp/hadoop-root/dfs/name ? (Y or N) Y
16/04/03 23:26:33 INFO util.GSet: Computing capacity for map BlocksMap
16/04/03 23:26:33 INFO util.GSet: VM type = 64-bit
16/04/03 23:26:33 INFO util.GSet: 2.0% max memory = 1013645312
16/04/03 23:26:33 INFO util.GSet: capacity = 2^21 = 2097152 entries
16/04/03 23:26:33 INFO util.GSet: recommended=2097152, actual=2097152
16/04/03 23:26:33 INFO namenode.FSNamesystem: fsOwner=root
16/04/03 23:26:33 INFO namenode.FSNamesystem: supergroup=supergroup
16/04/03 23:26:33 INFO namenode.FSNamesystem: isPermissionEnabled=true
16/04/03 23:26:33 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
16/04/03 23:26:33 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
16/04/03 23:26:33 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
16/04/03 23:26:33 INFO namenode.NameNode: Caching file names occuring more than 10 times
16/04/03 23:26:34 INFO common.Storage: Image file /tmp/hadoop-root/dfs/name/current/fsimage of size 110 bytes saved in 0 seconds.
16/04/03 23:26:34 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/tmp/hadoop-root/dfs/name/current/edits
16/04/03 23:26:34 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/tmp/hadoop-root/dfs/name/current/edits
16/04/03 23:26:34 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
16/04/03 23:26:34 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Centos/127.0.0.1
************************************************************/
---------------------------namenode格式化成功------------------------------
4.启动hadoop
关闭防火墙命令行 # service iptables stop
启动hadoop集群命令行 # start-all.sh
关闭hadoop集群命令行 # stop-all.sh
---------------------------------------------------------------------------
关闭防火墙
[root@Centos hadoop-1.2.1]# service iptables stop
iptables:清除防火墙规则: [确定]
iptables:将链设置为政策 ACCEPT:filter [确定]
iptables:正在卸载模块: [确定]
启动hadoop集群
[root@Centos hadoop-1.2.1]# bin/start-all.sh
starting namenode, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-namenode-Centos.out
localhost: starting datanode, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-datanode-Centos.out
localhost: starting secondarynamenode, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-secondarynamenode-Centos.out
starting jobtracker, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-jobtracker-Centos.out
localhost: starting tasktracker, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-tasktracker-Centos.out
验证集群是否正常启动----5个节点在列表中则启动成功
再次验证启动项目
[root@Centos hadoop-1.2.1]# cd mahout-distribution-0.6/
[root@Centos mahout-distribution-0.6]# jps
30692 SecondaryNameNode
30437 NameNode
31382 Jps
30903 TaskTracker
30775 JobTracker
30553 DataNode
[root@Centos mahout-distribution-0.6]# jps
30692 SecondaryNameNode
31477 Jps
30437 NameNode
30903 TaskTracker
30775 JobTracker
30553 DataNode
[root@Centos mahout-distribution-0.6]# cd ..
关闭hadoop集群
[root@Centos hadoop-1.2.1]# bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
[root@Centos hadoop-1.2.1]#
------------------------hadoop伪分布式安装成功------------------------
**********************************************************************
**********************************************************************
第二阶段:mahout的安装
1.解压安装mahout
[root@Centos hadoop-1.2.1]# tar zxvf mahout-distribution-0.6.tar.gz
2.配置环境变量
export HADOOP_HOME=/root/hadoop-1.2.1
export HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
export MAHOUT_HOME=/root/hadoop-1.2.1/mahoutdistribution-0.6
export MAHOUT_CONF_DIR=/root/hadoop-1.2.1/mahoutdistribution-0.6/conf
export PATH=$PATH:$MAHOUT_HOME/conf:$MAHOUT_HOME/bin
3.测试mahout的启动
-------------------------------------------------------------------------
[root@Centos mahout-distribution-0.6]# cd ..
[root@Centos hadoop-1.2.1]# bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
[root@Centos hadoop-1.2.1]# cd ..
You have mail in /var/spool/mail/root
[root@Centos ~]# cd ruanjian/
[root@Centos ruanjian]# tar zxvf
tar: 旧选项“f”需要参数。
请用“tar --help”或“tar --usage”获得更多信息。
[root@Centos ruanjian]# cd ..
[root@Centos ~]# cd hadoop-1.2.1/
[root@Centos hadoop-1.2.1]# export HADOOP_HOME=/root/hadoop-1.2.1
[root@Centos hadoop-1.2.1]# export HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
[root@Centos hadoop-1.2.1]# export MAHOUT_HOME=/root/hadoop-1.2.1/mahoutdistribution-0.6
[root@Centos hadoop-1.2.1]# export MAHOUT_CONF_DIR=/root/hadoop-1.2.1/mahoutdistribution-0.6/conf
[root@Centos hadoop-1.2.1]# export PATH=$PATH:$MAHOUT_HOME/conf:$MAHOUT_HOME/bin
[root@Centos hadoop-1.2.1]# cd mahout-distribution-0.6/
[root@Centos mahout-distribution-0.6]# bin/mahout
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/root/hadoop-1.2.1
HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
MAHOUT-JOB: /root/hadoop-1.2.1/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.

An example program must be given as the first argument.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
dirichlet: : Dirichlet Clustering
eigencuts: : Eigencuts spectral clustering
evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
fkmeans: : Fuzzy K-means clustering
fpg: : Frequent Pattern Growth
hmmpredict: : Generate random sequence of observations by given HMM
itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
kmeans: : K-means clustering
lda: : Latent Dirchlet Allocation
ldatopics: : LDA Print Topics
lucene.vector: : Generate Vectors from a Lucene index
matrixdump: : Dump matrix in CSV format
matrixmult: : Take the product of two matrices
meanshift: : Mean Shift clustering
minhash: : Run Minhash clustering
pagerank: : compute the PageRank of a graph
parallelALS: : ALS-WR factorization of a rating matrix
prepare20newsgroups: : Reformat 20 newsgroups data
randomwalkwithrestart: : compute all other vertices' proximity to a source vertex in a graph
recommendfactorized: : Compute recommendations using the factorization of a rating matrix
recommenditembased: : Compute recommendations using item-based collaborative filtering
regexconverter: : Convert text files on a per line basis based on regular expressions
rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
runlogistic: : Run a logistic regression model against CSV data
seq2encoded: : Encoded Sparse Vector generation from Text sequence files
seq2sparse: : Sparse Vector generation from Text sequence files
seqdirectory: : Generate sequence files (of Text) from a directory
seqdumper: : Generic Sequence File dumper
seqwiki: : Wikipedia xml dump to sequence file
spectralkmeans: : Spectral k-means clustering
split: : Split Input data into test and train sets
splitDataset: : split a rating dataset into training and probe parts
ssvd: : Stochastic SVD
svd: : Lanczos Singular Value Decomposition
testclassifier: : Test the text based Bayes Classifier
testnb: : Test the Vector-based Bayes classifier
trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
trainclassifier: : Train the text based Bayes Classifier
trainlogistic: : Train a logistic regression using stochastic gradient descent
trainnb: : Train the Vector-based Bayes classifier
transpose: : Take the transpose of a matrix
validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
vectordump: : Dump vectors from a sequence file to text
viterbi: : Viterbi decoding of hidden states from given output states sequence
wikipediaDataSetCreator: : Splits data set of wikipedia wrt feature like country
wikipediaXMLSplitter: : Reads wikipedia data and creates ch
[root@Centos mahout-distribution-0.6]#
**********An example program must be given as the first argument.******出现则表示mahout安装成功

--------------------------------mahout安装成功-----------------------------------------------------------------------

第三阶段:20newsgroups的bayes算法测试
1.解压20newsgroups的压缩包
1.在根目录下创建data目录将下载的20newsgroups文件进行解压
----------------------------------------------------------------------
[root@Centos mahout-distribution-0.6]# cd ..
[root@Centos hadoop-1.2.1]# cd ..
[root@Centos ~]# mkdir data
[root@Centos ~]# ls
anaconda-ks.cfg install.log ruanjian 视频 下载
data install.log.syslog 公共的 图片 音乐
hadoop-1.2.1 jdk1.8.0_65 模板 文档 桌面
[root@Centos ~]# cd data/
[root@Centos data]# ls
20news-bydate.tar.gz
[root@Centos data]# tar zxvf
tar: 旧选项“f”需要参数。
请用“tar --help”或“tar --usage”获得更多信息。
[root@Centos data]# tar zxvf 20news-bydate.tar.gz
[root@Centos data]# ls
20news-bydate.tar.gz 20news-bydate-test 20news-bydate-train
[root@Centos data]#
-----------------------------------------------------------------------------------
2.启动mahout
----------------------------------------------------------------------------------
[root@Centos data]# cd /root/hadoop-1.2.1/mahout-distribution-0.6/
[root@Centos mahout-distribution-0.6]# jps
34338 Jps
[root@Centos mahout-distribution-0.6]# cd ..
[root@Centos hadoop-1.2.1]# bin/start-all.sh
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-namenode-Centos.out
localhost: starting datanode, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-datanode-Centos.out
localhost: starting secondarynamenode, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-secondarynamenode-Centos.out
starting jobtracker, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-jobtracker-Centos.out
localhost: starting tasktracker, logging to /root/hadoop-1.2.1/libexec/../logs/hadoop-root-tasktracker-Centos.out
[root@Centos hadoop-1.2.1]# cd mahout-distribution-0.6/
[root@Centos mahout-distribution-0.6]# jps
34979 Jps
34757 JobTracker
34886 TaskTracker
34663 SecondaryNameNode
34408 NameNode
34524 DataNode
[root@Centos mahout-distribution-0.6]# bin/mahout
-------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------
************************************************************************************************************************
贝叶斯算法测试-----20newsgroups的文本自动分类
第一步:建立训练集
bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
-p /root/data/20news-bydate-train \
-o /root/data/bayes-test-input \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
-c UTF-8

bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
-p /root/data/20news-bydate-train \
-o /root/data/bayes-train-input \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
-c UTF-8
-----------------------------------------------------------------------------------------------------
建立训练集
------------------------------------------------------------------------------------------------------
[root@Centos mahout-distribution-0.6]# bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
> -p /root/data/20news-bydate-train \
> -o /root/data/bayes-test-input \
> -a org.apache.mahout.vectorizer.DefaultAnalyzer \
>
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/root/hadoop-1.2.1
HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
MAHOUT-JOB: /root/hadoop-1.2.1/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.

16/04/04 08:59:20 WARN driver.MahoutDriver: No org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups.props found on classpath, will use command-line arguments only
Usage:
[--analyzerName <analyzerName> --charset <charset> --outputDir <outputDir>
--parent <parent> --help]
Options
--analyzerName (-a) analyzerName The class name of the analyzer
--charset (-c) charset The name of the character encoding of the
input files
--outputDir (-o) outputDir The output directory
--parent (-p) parent Parent dir containing the newsgroups
--help (-h) Print out help
16/04/04 08:59:20 INFO driver.MahoutDriver: Program took 167 ms (Minutes: 0.0027833333333333334)
[root@Centos mahout-distribution-0.6]# bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
> -p /root/data/20news-bydate-train \
> -o /root/data/bayes-test-input \
> -a org.apache.mahout.vectorizer.DefaultAnalyzer \
> -c UTF-8
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/root/hadoop-1.2.1
HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
MAHOUT-JOB: /root/hadoop-1.2.1/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.

16/04/04 08:59:41 WARN driver.MahoutDriver: No org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups.props found on classpath, will use command-line arguments only
16/04/04 09:00:29 INFO driver.MahoutDriver: Program took 47897 ms (Minutes: 0.7982833333333333)
[root@Centos mahout-distribution-0.6]# bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \
> -p /root/data/20news-bydate-train \
> -o /root/data/bayes-train-input \
> -a org.apache.mahout.vectorizer.DefaultAnalyzer \
> -c UTF-8
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/root/hadoop-1.2.1
HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
MAHOUT-JOB: /root/hadoop-1.2.1/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.

16/04/04 09:01:07 WARN driver.MahoutDriver: No org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups.props found on classpath, will use command-line arguments only
16/04/04 09:01:27 INFO driver.MahoutDriver: Program took 19347 ms (Minutes: 0.32245)
---------------查看输出文件
[root@Centos mahout-distribution-0.6]# cd ..
[root@Centos hadoop-1.2.1]# cd ..
[root@Centos ~]# cd data
[root@Centos data]# ls
20news-bydate.tar.gz 20news-bydate-train bayes-train-input
20news-bydate-test bayes-test-input
[root@Centos data]#
-------------------bayes-test-input----bayes-train-input-----------------训练集建立成功---



第二步:上传到HDFS
建立上传文件夹: bin/hadoop fs -mkdir 20news
上传到HDFS: bin/hadoop fs -put 本地目录 20news
查看: bin/hadoop fs -ls
bin/hadoop fs -ls /20news
-----------------------------------------------------------------------------------------------------------------------
[root@Centos hadoop-1.2.1]# cd /root/hadoop-1.2.1/
[root@Centos hadoop-1.2.1]# bin/hadoop fs -mkdir 20news
[root@Centos hadoop-1.2.1]# bin/hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 1 items
drwxr-xr-x - root supergroup 0 2016-04-04 09:08 /user/root/20news

[root@Centos hadoop-1.2.1]# bin/hadoop fs -put ../data/bayes-train-input/ ./20news
[root@Centos hadoop-1.2.1]# bin/hadoop fs -ls 20news
Warning: $HADOOP_HOME is deprecated.
Found 1 items
drwxr-xr-x - root supergroup 0 2016-04-04 09:08 /user/root/20news/bayes-train-input
[root@Centos hadoop-1.2.1]# bin/hadoop fs -put ../data/bayes-test-input/ ./20news
[root@Centos hadoop-1.2.1]# bin/hadoop fs -ls 20news
Warning: $HADOOP_HOME is deprecated.
Found 2 items
drwxr-xr-x - root supergroup 0 2016-04-04 09:08 /user/root/20news/bayes-test-input
drwxr-xr-x - root supergroup 0 2016-04-04 09:08 /user/root/20news/bayes-train-input
[root@Centos hadoop-1.2.1]# bin/hadoop fs -ls 20news/bayes-train-input
Warning: $HADOOP_HOME is deprecated.

Found 20 items
-rw-r--r-- 1 root supergroup 773301 2016-04-04 09:08 /user/root/20news/bayes-train-input/alt.atheism.txt
-rw-r--r-- 1 root supergroup 687018 2016-04-04 09:08 /user/root/20news/bayes-train-input/comp.graphics.txt
-rw-r--r-- 1 root supergroup 1371301 2016-04-04 09:08 /user/root/20news/bayes-train-input/comp.os.ms-windows.misc.txt
-rw-r--r-- 1 root supergroup 605082 2016-04-04 09:08 /user/root/20news/bayes-train-input/comp.sys.ibm.pc.hardware.txt
-rw-r--r-- 1 root supergroup 539488 2016-04-04 09:08 /user/root/20news/bayes-train-input/comp.sys.mac.hardware.txt
-rw-r--r-- 1 root supergroup 924668 2016-04-04 09:08 /user/root/20news/bayes-train-input/comp.windows.x.txt
-rw-r--r-- 1 root supergroup 457202 2016-04-04 09:08 /user/root/20news/bayes-train-input/misc.forsale.txt
-rw-r--r-- 1 root supergroup 649942 2016-04-04 09:08 /user/root/20news/bayes-train-input/rec.autos.txt
-rw-r--r-- 1 root supergroup 610103 2016-04-04 09:08 /user/root/20news/bayes-train-input/rec.motorcycles.txt
-rw-r--r-- 1 root supergroup 648313 2016-04-04 09:08 /user/root/20news/bayes-train-input/rec.sport.baseball.txt
-rw-r--r-- 1 root supergroup 870760 2016-04-04 09:08 /user/root/20news/bayes-train-input/rec.sport.hockey.txt
-rw-r--r-- 1 root supergroup 1139592 2016-04-04 09:08 /user/root/20news/bayes-train-input/sci.crypt.txt
-rw-r--r-- 1 root supergroup 616166 2016-04-04 09:08 /user/root/20news/bayes-train-input/sci.electronics.txt
-rw-r--r-- 1 root supergroup 901841 2016-04-04 09:08 /user/root/20news/bayes-train-input/sci.med.txt
-rw-r--r-- 1 root supergroup 913047 2016-04-04 09:08 /user/root/20news/bayes-train-input/sci.space.txt
-rw-r--r-- 1 root supergroup 1004842 2016-04-04 09:08 /user/root/20news/bayes-train-input/soc.religion.christian.txt
-rw-r--r-- 1 root supergroup 973157 2016-04-04 09:08 /user/root/20news/bayes-train-input/talk.politics.guns.txt
-rw-r--r-- 1 root supergroup 1317255 2016-04-04 09:08 /user/root/20news/bayes-train-input/talk.politics.mideast.txt
-rw-r--r-- 1 root supergroup 980920 2016-04-04 09:08 /user/root/20news/bayes-train-input/talk.politics.misc.txt
-rw-r--r-- 1 root supergroup 623882 2016-04-04 09:08 /user/root/20news/bayes-train-input/talk.religion.misc.txt

[root@Centos hadoop-1.2.1]# bin/hadoop fs -ls 20news/bayes-test-input
Warning: $HADOOP_HOME is deprecated.

Found 20 items
-rw-r--r-- 1 root supergroup 773301 2016-04-04 09:08 /user/root/20news/bayes-test-input/alt.atheism.txt
-rw-r--r-- 1 root supergroup 687018 2016-04-04 09:08 /user/root/20news/bayes-test-input/comp.graphics.txt
-rw-r--r-- 1 root supergroup 1371301 2016-04-04 09:08 /user/root/20news/bayes-test-input/comp.os.ms-windows.misc.txt
-rw-r--r-- 1 root supergroup 605082 2016-04-04 09:08 /user/root/20news/bayes-test-input/comp.sys.ibm.pc.hardware.txt
-rw-r--r-- 1 root supergroup 539488 2016-04-04 09:08 /user/root/20news/bayes-test-input/comp.sys.mac.hardware.txt
-rw-r--r-- 1 root supergroup 924668 2016-04-04 09:08 /user/root/20news/bayes-test-input/comp.windows.x.txt
-rw-r--r-- 1 root supergroup 457202 2016-04-04 09:08 /user/root/20news/bayes-test-input/misc.forsale.txt
-rw-r--r-- 1 root supergroup 649942 2016-04-04 09:08 /user/root/20news/bayes-test-input/rec.autos.txt
-rw-r--r-- 1 root supergroup 610103 2016-04-04 09:08 /user/root/20news/bayes-test-input/rec.motorcycles.txt
-rw-r--r-- 1 root supergroup 648313 2016-04-04 09:08 /user/root/20news/bayes-test-input/rec.sport.baseball.txt
-rw-r--r-- 1 root supergroup 870760 2016-04-04 09:08 /user/root/20news/bayes-test-input/rec.sport.hockey.txt
-rw-r--r-- 1 root supergroup 1139592 2016-04-04 09:08 /user/root/20news/bayes-test-input/sci.crypt.txt
-rw-r--r-- 1 root supergroup 616166 2016-04-04 09:08 /user/root/20news/bayes-test-input/sci.electronics.txt
-rw-r--r-- 1 root supergroup 901841 2016-04-04 09:08 /user/root/20news/bayes-test-input/sci.med.txt
-rw-r--r-- 1 root supergroup 913047 2016-04-04 09:08 /user/root/20news/bayes-test-input/sci.space.txt
-rw-r--r-- 1 root supergroup 1004842 2016-04-04 09:08 /user/root/20news/bayes-test-input/soc.religion.christian.txt
-rw-r--r-- 1 root supergroup 973157 2016-04-04 09:08 /user/root/20news/bayes-test-input/talk.politics.guns.txt
-rw-r--r-- 1 root supergroup 1317255 2016-04-04 09:08 /user/root/20news/bayes-test-input/talk.politics.mideast.txt
-rw-r--r-- 1 root supergroup 980920 2016-04-04 09:08 /user/root/20news/bayes-test-input/talk.politics.misc.txt
-rw-r--r-- 1 root supergroup 623882 2016-04-04 09:08 /user/root/20news/bayes-test-input/talk.religion.misc.txt

[root@Centos hadoop-1.2.1]# bin/hadoop fs -cat 20news/bayes-train-input/talk.politics.misc.txt


rce most part uninformed ignorant public democracy i don't think so society's sense justice judged basis treatment people who make up society all those people yes includes gays lesbians bisexuals whose crimes have victims who varied diverse society wich part frank jordan d d d c c c gay arab bassoonists unite
talk.politics.misc from steveh thor.isc br.com steve hendricks subject re limiting govt re employment re why concentrate summary promoting competition does depend upon libertarians organization free barbers inc lines 60 nntp posting host thor.isc br.com article c5kh8g 961 cbnewse.cb.att.com doctor1 cbnewse.cb.att.com patrick.b.hailey writes article 1993apr15.170731.8797 isc br.isc br.com steveh thor.isc br.com steve hendricks writes two paragraphs from two different posts splicing them together my intention change steve's meaning misrepresent him any way i don't think i've done so noted another thread limiting govt problem libertarians face insuring limited government seek does become tool private interests pursue own agenda failure libertarianism ideology does provide any reasonable way restrain actions other than utopian dreams just marxism fails specify how pure communism achieved state wither away libertarians frequently fail show how weakening power state result improvement human condition patrick's example anti competitive regulations auto dealers deleted here's what i see libertarianism offering you does seem me utopian dream basic human decency common sense real grass roots example freedom liberty yes having few people acting our masters approving rejecting each our basic transactions each other does strike me wonderful way improve human condition thanks awfully patrick let me try drag discussion back original issues i've noted before i'm necessarily disputing benefits eliminating anti competitive legislation regard auto dealers barbers etc one need however swallow entire libertarian agenda accomplish end just because one grants benefits allowing anyone who wishes cut hair sell his her services without regulation does mean same unregulated barbers should free bleed people medical service without government intervention some many libertarians would argue case case basis cost benefit ratio government regulation obviously worthwhile libertarian agenda however does call assessment assumes costs regulation any kind always outweigh its benefits approach avoids all sorts difficult analysis strikes many rest us dogmatic say least i have objection analysis medical care education national defense local police suggests free market can provide more effective efficient means accomplishing social obj

 

第三步:训练贝叶斯分类器
1.模型训练,已经上传了训练文本集,然后依据训练文本集来训练贝叶斯分类器模型。
解释一下命令:-i 表示训练集的输入路径,HDFS路径。 -o分类模型输出路径 -type 分类器类型,这里使用bayes,可选cbayes -ng n-gram建模的大小,默认为1 -source
数据源的位置,HDFS或HBase 后面的测试也是一样的。

bin/mahout trainclassifier \
-i /user/root/20news/bayes-train-input \
-o /user/root/20news/newsmodel \
-type cbayes \
-ng 2 \
-source hdfs

 

---------------------------------------------------------------------------------------------------------------
[root@Centos hadoop-1.2.1]# cd mahout-distribution-0.6/
[root@Centos mahout-distribution-0.6]# bin/mahout
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/root/hadoop-1.2.1
HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
MAHOUT-JOB: /root/hadoop-1.2.1/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.
[root@Centos mahout-distribution-0.6]# bin/mahout trainclassifier \
> -i /user/root/20news/bayes-train-input \
> -o /user/root/20news/newsmodel \
> -type cbayes \
> -ng 2 \
> -source hdfs
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/root/hadoop-1.2.1
HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
MAHOUT-JOB: /root/hadoop-1.2.1/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.

16/04/04 09:21:58 WARN driver.MahoutDriver: No trainclassifier.props found on classpath, will use command-line arguments only
16/04/04 09:21:58 INFO bayes.TrainClassifier: Training Complementary Bayes Classifier
16/04/04 09:21:59 INFO cbayes.CBayesDriver: Reading features...
16/04/04 09:22:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/04/04 09:22:02 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/04/04 09:22:02 WARN snappy.LoadSnappy: Snappy native library not loaded
16/04/04 09:22:02 INFO mapred.FileInputFormat: Total input paths to process : 20
16/04/04 09:22:04 INFO mapred.JobClient: Running job: job_201604040854_0001
16/04/04 09:22:05 INFO mapred.JobClient: map 0% reduce 0%
16/04/04 09:22:48 INFO mapred.JobClient: map 1% reduce 0%
16/04/04 09:22:49 INFO mapred.JobClient: map 2% reduce 0%
16/04/04 09:23:11 INFO mapred.JobClient: map 3% reduce 0%
16/04/04 09:23:12 INFO mapred.JobClient: map 4% reduce 0%
····································
··········································
···················································
16/04/04 10:04:11 INFO mapred.JobClient: Job complete: job_201604040854_0004
16/04/04 10:04:11 INFO mapred.JobClient: Counters: 30
16/04/04 10:04:11 INFO mapred.JobClient: Map-Reduce Framework
16/04/04 10:04:11 INFO mapred.JobClient: Spilled Records=4309
16/04/04 10:04:12 INFO mapred.JobClient: Map output materialized bytes=1473
16/04/04 10:04:12 INFO mapred.JobClient: Reduce input records=41
16/04/04 10:04:12 INFO mapred.JobClient: Virtual memory (bytes) snapshot=7733702656
16/04/04 10:04:12 INFO mapred.JobClient: Map input records=3146637
16/04/04 10:04:12 INFO mapred.JobClient: SPLIT_RAW_BYTES=416
16/04/04 10:04:12 INFO mapred.JobClient: Map output bytes=965613985
16/04/04 10:04:12 INFO mapred.JobClient: Reduce shuffle bytes=1473
16/04/04 10:04:12 INFO mapred.JobClient: Physical memory (bytes) snapshot=682602496
16/04/04 10:04:12 INFO mapred.JobClient: Map input bytes=150138778
16/04/04 10:04:12 INFO mapred.JobClient: Reduce input groups=20
16/04/04 10:04:12 INFO mapred.JobClient: Combine output records=2128
16/04/04 10:04:12 INFO mapred.JobClient: Reduce output records=20
16/04/04 10:04:12 INFO mapred.JobClient: Map output records=28673441
16/04/04 10:04:12 INFO mapred.JobClient: Combine input records=28675528
16/04/04 10:04:12 INFO mapred.JobClient: CPU time spent (ms)=210830
16/04/04 10:04:12 INFO mapred.JobClient: Total committed heap usage (bytes)=498544640
16/04/04 10:04:12 INFO mapred.JobClient: File Input Format Counters
16/04/04 10:04:12 INFO mapred.JobClient: Bytes Read=150140285
16/04/04 10:04:12 INFO mapred.JobClient: FileSystemCounters
16/04/04 10:04:12 INFO mapred.JobClient: HDFS_BYTES_READ=150140770
16/04/04 10:04:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=383730
16/04/04 10:04:12 INFO mapred.JobClient: FILE_BYTES_READ=152894
16/04/04 10:04:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=932
16/04/04 10:04:12 INFO mapred.JobClient: File Output Format Counters
16/04/04 10:04:12 INFO mapred.JobClient: Bytes Written=932
16/04/04 10:04:12 INFO mapred.JobClient: Job Counters
16/04/04 10:04:12 INFO mapred.JobClient: Launched map tasks=3
16/04/04 10:04:12 INFO mapred.JobClient: Launched reduce tasks=1
16/04/04 10:04:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=214633
16/04/04 10:04:12 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
16/04/04 10:04:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=320403
16/04/04 10:04:12 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
16/04/04 10:04:12 INFO mapred.JobClient: Data-local map tasks=3
16/04/04 10:04:14 INFO common.HadoopUtil: Deleting /user/root/20news/newsmodel/trainer-docCount
16/04/04 10:04:15 INFO common.HadoopUtil: Deleting /user/root/20news/newsmodel/trainer-termDocCount
16/04/04 10:04:15 INFO common.HadoopUtil: Deleting /user/root/20news/newsmodel/trainer-featureCount
16/04/04 10:04:15 INFO common.HadoopUtil: Deleting /user/root/20news/newsmodel/trainer-wordFreq
16/04/04 10:04:15 INFO common.HadoopUtil: Deleting /user/root/20news/newsmodel/trainer-tfIdf/trainer-vocabCount
16/04/04 10:04:16 INFO driver.MahoutDriver: Program took 2537700 ms (Minutes: 42.29723333333333)
[root@Centos mahout-distribution-0.6]#
------------------------------------------------------------------------------------------------------------------------
第四步测试贝叶斯模型
bin/mahout testclassifier \
-m /user/root/20news/newsmodel \
-d /user/root/20news/bayes-test-input \
-type cbayes \
-ng 2 \
-source hdfs \
-method mapreduce
---------------------------------------------------------------------------------
第四步:生成模型

第五步:测试贝叶斯分类器
---------------------------------
[root@Centos mahout-distribution-0.6]# bin/mahout testclassifier \
> -m /user/root/20news/newtestsmodel \
> -d /user/root/20news/bayes-test-input \
> -type cbayes \
> -ng 2 \
> -source hdfs \
> -method mapreduce
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/root/hadoop-1.2.1
HADOOP_CONF_DIR=/root/hadoop-1.2.1/conf
MAHOUT-JOB: /root/hadoop-1.2.1/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Warning: $HADOOP_HOME is deprecated.

16/04/04 14:10:54 WARN driver.MahoutDriver: No testclassifier.props found on classpath, will use command-line arguments only
16/04/04 14:10:56 INFO common.HadoopUtil: Deleting /user/root/20news/bayes-test-input-output
16/04/04 14:10:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
16/04/04 14:11:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/04/04 14:11:00 WARN snappy.LoadSnappy: Snappy native library not loaded
16/04/04 14:11:00 INFO mapred.FileInputFormat: Total input paths to process : 20
16/04/04 14:11:02 INFO mapred.JobClient: Running job: job_201604040854_0011
16/04/04 14:11:03 INFO mapred.JobClient: map 0% reduce 0%
16/04/04 14:11:47 INFO mapred.JobClient: map 5% reduce 0%
16/04/04 14:11:52 INFO mapred.JobClient: map 10% reduce 0%
16/04/04 14:12:33 INFO mapred.JobClient: map 19% reduce 0%
16/04/04 14:12:45 INFO mapred.JobClient: map 19% reduce 6%
16/04/04 14:12:58 INFO mapred.JobClient: map 29% reduce 6%
16/04/04 14:13:09 INFO mapred.JobClient: map 29% reduce 10%
16/04/04 14:13:36 INFO mapred.JobClient: map 39% reduce 10%
16/04/04 14:13:45 INFO mapred.JobClient: map 39% reduce 13%
16/04/04 14:13:53 INFO mapred.JobClient: map 44% reduce 13%
16/04/04 14:13:54 INFO mapred.JobClient: map 50% reduce 13%
16/04/04 14:14:01 INFO mapred.JobClient: map 50% reduce 16%
16/04/04 14:14:03 INFO mapred.JobClient: map 55% reduce 16%
16/04/04 14:14:04 INFO mapred.JobClient: map 60% reduce 16%
16/04/04 14:14:11 INFO mapred.JobClient: map 60% reduce 20%
16/04/04 14:14:22 INFO mapred.JobClient: map 70% reduce 20%
16/04/04 14:14:31 INFO mapred.JobClient: map 70% reduce 23%
16/04/04 14:14:34 INFO mapred.JobClient: map 80% reduce 23%
16/04/04 14:14:41 INFO mapred.JobClient: map 80% reduce 26%
16/04/04 14:14:43 INFO mapred.JobClient: map 85% reduce 26%
16/04/04 14:14:44 INFO mapred.JobClient: map 90% reduce 26%
16/04/04 14:14:47 INFO mapred.JobClient: map 90% reduce 30%
16/04/04 14:14:52 INFO mapred.JobClient: map 95% reduce 30%
16/04/04 14:14:53 INFO mapred.JobClient: map 100% reduce 30%
16/04/04 14:15:02 INFO mapred.JobClient: map 100% reduce 66%
16/04/04 14:15:11 INFO mapred.JobClient: map 100% reduce 100%
16/04/04 14:15:16 INFO mapred.JobClient: Job complete: job_201604040854_0011
16/04/04 14:15:28 INFO mapred.JobClient: Counters: 30
16/04/04 14:15:28 INFO mapred.JobClient: Map-Reduce Framework
16/04/04 14:15:28 INFO mapred.JobClient: Spilled Records=40
16/04/04 14:15:28 INFO mapred.JobClient: Map output materialized bytes=993
16/04/04 14:15:28 INFO mapred.JobClient: Reduce input records=20
16/04/04 14:15:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=40516427776
16/04/04 14:15:28 INFO mapred.JobClient: Map input records=11314
16/04/04 14:15:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=2573
16/04/04 14:15:28 INFO mapred.JobClient: Map output bytes=470632
16/04/04 14:15:28 INFO mapred.JobClient: Reduce shuffle bytes=993
16/04/04 14:15:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=4085964800
16/04/04 14:15:28 INFO mapred.JobClient: Map input bytes=16607880
16/04/04 14:15:28 INFO mapred.JobClient: Reduce input groups=20
16/04/04 14:15:28 INFO mapred.JobClient: Combine output records=20
16/04/04 14:15:28 INFO mapred.JobClient: Reduce output records=20
16/04/04 14:15:28 INFO mapred.JobClient: Map output records=11314
16/04/04 14:15:28 INFO mapred.JobClient: Combine input records=11314
16/04/04 14:15:28 INFO mapred.JobClient: CPU time spent (ms)=34980
16/04/04 14:15:28 INFO mapred.JobClient: Total committed heap usage (bytes)=3097051136
16/04/04 14:15:28 INFO mapred.JobClient: File Input Format Counters
16/04/04 14:15:28 INFO mapred.JobClient: Bytes Read=16607880
16/04/04 14:15:28 INFO mapred.JobClient: FileSystemCounters
16/04/04 14:15:28 INFO mapred.JobClient: HDFS_BYTES_READ=16610453
16/04/04 14:15:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1166412
16/04/04 14:15:28 INFO mapred.JobClient: FILE_BYTES_READ=879
16/04/04 14:15:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1092
16/04/04 14:15:28 INFO mapred.JobClient: File Output Format Counters
16/04/04 14:15:28 INFO mapred.JobClient: Bytes Written=1092
16/04/04 14:15:28 INFO mapred.JobClient: Job Counters
16/04/04 14:15:28 INFO mapred.JobClient: Launched map tasks=20
16/04/04 14:15:28 INFO mapred.JobClient: Launched reduce tasks=1
16/04/04 14:15:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=195607
16/04/04 14:15:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
16/04/04 14:15:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=406966
16/04/04 14:15:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
16/04/04 14:15:28 INFO mapred.JobClient: Data-local map tasks=20
16/04/04 14:15:38 INFO bayes.BayesClassifierDriver: =======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 a = soc.religion.christian
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 b = rec.autos
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 c = talk.religion.misc
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 d = comp.windows.x
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 e = rec.sport.baseball
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 f = comp.graphics
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 g = talk.politics.mideast
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 h = comp.sys.ibm.pc.hardware
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 i = sci.med
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 j = comp.os.ms-windows.misc
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 k = sci.crypt
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 l = comp.sys.mac.hardware
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 m = misc.forsale
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 n = rec.motorcycles
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 o = talk.politics.misc
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 p = sci.electronics
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 q = rec.sport.hockey
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 r = sci.space
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 s = alt.atheism
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | 0 t = talk.politics.guns


16/04/04 14:15:38 INFO driver.MahoutDriver: Program took 283133 ms (Minutes: 4.718883333333333)


-------------------------------

转载于:https://www.cnblogs.com/learningforever/p/5350460.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值