贝叶斯并行分类分析
1 贝叶斯训练器
所在包:Package org.apache.mahout.classifier.bayes
实现机制
The implementation is divided up into three parts:
-
The Trainer -- responsible for doing the counting of the words and the labels
-
The Model -- responsible for holding the training data in a useful way
-
The Classifier -- responsible for using the trainers output to determine the category of previously unseen documents
1训练器
The trainer is manifested in several classes:
-
创建
Hadoop
贝叶斯作业,输出模型,这个类封装了
4
个
map/reduce
类。
训练器的输入是KeyValueTextInputFormat
格式,第一个字符时类标签,剩余的是特征(单词),如下面的格式:
hockey puck stick goalie forward defenseman referee ice checking slapshot helmet
football field football pigskin referee helmet turf tackle
hockey 和football 是类标签,剩下的是特征。
2模型
is the data structure used to represent the results of the training for use by the BayesClassifier
. A Model can be created by hand, or, if using the BayesDriver
, it can be created from the SequenceFile
that is output. To create it from the SequenceFile, use the SequenceFileModelReader
located in the io subpackage.
3分类器
The BayesClassifier
is responsible for using a BayesModel
to classify documents into categories.
4 Class TrainClassifier解读
所在包:org.apache.mahout.classifier.bayes
负责训练贝叶斯分类器,输入的格式:每一行是一个文本,第一个字符时类的标签,剩下的是特征(单词)
这个类会根据命令行参数调用两个训练器:
|
|
|
|
trainCNaiveBayes
函数调用
CBayesDriver
类;
trainNaiveBayes
会调用
BayesDriver
类
下面分别分析
CBayesDriver
类
和
BayesDriver
类
BayesDriver
所在包:
org.apache.mahout.classifier.bayes.mapreduce.bayes
public class BayesDriverextends
Object
implements BayesJob
实现了
BayesJob
接口
在这个类的
runJob
函数里会调用调用
4
个
map/reduce
作业类
第一个:
BayesFeatureDriver
负责
Read the features in each document normalized by length of each document
第二个:
BayesTfIdfDriver
负责
Calculate the TfIdf for each word in each label
第三个:
BayesWeightSummerDriver
负责
alculate the Sums of weights for each label, for each feature
第四个:
BayesThetaNormalizerDriv
负责:
Calculate the normalization factor Sigma_W_ij for each complement class
下面分别分析这个四个类:
一个
map/reduce
类:
BayesFeatureDriver
所在包:
package
org.apache.mahout.classifier.bayes.mapreduce.common;
输出
key
类型:
StringTuple.
class
输出
value
类型:
DoubleWritable.
class
输入格式:
KeyValueTextInputFormat.
class
输出格式:
BayesFeatureOutputFormat
class
MAP
:
BayesFeatureMapper.
class
REDUCE
:
BayesFeatureReducer.
class
注意:
BayesFeatureDriver
可以独立运行,默认的输入和输出:
input =
new
Path(
"/home/drew/mahout/bayes/20news-input"
);
output = newPath("/home/drew/mahout/bayes/20-news-features");
p =
new
BayesParameters(1) gramsize
默认为
1
输出会生成三个文件
$OUTPUT/
trainer-termDocCount
$OUTPUT/
trainer-wordFreq
$OUTPUT/
trainer-featureCount
下来的第二个
map/reduce
类
BayesTfIdfDriver
会根据这第一个的输出文件
计算
TF-IDF
值,计算完毕后会删除这三个中间文件,并生成文件:
trainer-tfIdf
保存文本特征的
it-idf
值,
第三个:
BayesWeightSummerDriver
输出
key
:
StringTuple.
class
输出
value:DoubleWritable.
class
输入路径:就是第二个
map/reduce
生成的
trainer-tfIdf
文件
输出:
trainer-weights
文件
输入文件格式:
SequenceFileInputFormat.
class
输出文件格式:
BayesWeightSummerOutputF
class
第四个
job
:
BayesThetaNormalizerDriv
输出
key
:
StringTuple.
class
输出
value:DoubleWritable.
class
输入路径:
FileInputFormat.addInputPath(conf,
new
Path(output,
"trainer-tfIdf/trainer-tfIdf"
));
就是需要使用第二个
job
的输出:
trainer-tfIdf
文件
输出路径:
Path outPath =
new
Path(output,
"trainer-thetaNormalizer"
);
会生成文件:
trainer-thetaNormalizer
输出文件格式:
SequenceFileOutputFormat
class
这个四个
job
执行完毕后整个
bayes
模型就建立完毕了,最后总共生成并保存三个目录文件:
trainer-tfIdf
trainer-thetaNormalizer
trainer-weights
模型建好了,下来就是测试分类器的效果
调用类:
TestClassifier
所在包:
package
org.apache.mahout.classifier.bayes;
根据命令行参数会选择顺序执行还是并行
map/reduce
执行
并行执行回调用
BayesClassifierDriver
类
下面分析
BayesClassifierDriver
类
所在包:
package
org.apache.mahout.classifier.bayes.mapreduce.bayes;
输入格式:
KeyValueTextInputFormat.
class
输出格式:
SequenceFileOutputFormat
class
执行完毕后会调用混合矩阵:
ConfusionMatrix
函数显示结果