mahout并行bayes分类源码分析-2

最新推荐文章于 2021-02-19 13:11:06 发布

nuoline

最新推荐文章于 2021-02-19 13:11:06 发布

阅读量95

点赞数

分类专栏： mahout 文章标签： Hadoop Mapreduce Apache 数据结构

本文链接：https://blog.csdn.net/nuoline/article/details/83770241

版权

mahout 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

2模型
BayesModel 类时用来表示训练结果的数据结构，BayesClassifier 需要使用。
is the data structure used to represent the results of the training for use by the BayesClassifier. A Model can be created by hand, or, if using the BayesDriver, it can be created from the SequenceFile that is output. To create it from the SequenceFile, use the SequenceFileModelReader located in the io subpackage.
3分类器
The BayesClassifier is responsible for using a BayesModel to classify documents into categories.
4 Class TrainClassifier解读
所在包： org.apache.mahout.classifier.bayes
负责训练贝叶斯分类器，输入的格式：每一行是一个文本，第一个字符时类的标签，剩下的是特征（单词）
这个类会根据命令行参数调用两个训练器：

static void    trainCNaiveBayes(org.apache.hadoop.fs.Path dir, org.apache.hadoop.fs.Path outputDir, BayesParameters params)

static void    trainNaiveBayes(org.apache.hadoop.fs.Path dir, org.apache.hadoop.fs.Path outputDir, BayesParameters params)
trainCNaiveBayes函数调用CBayesDriver类；trainNaiveBayes会调用BayesDriver类
下面分别分析CBayesDriver类和BayesDriver类
BayesDriver所在包： org.apache.mahout.classifier.bayes.mapreduce.bayes
public class BayesDriverextends Object implements BayesJob
实现了BayesJob接口
在这个类的runJob函数里会调用调用4个map/reduce作业类
第一个：BayesFeatureDriver负责Read the features in each document normalized by length of each document
第二个：BayesTfIdfDriver负责Calculate the TfIdf for each word in each label
第三个：BayesWeightSummerDriver负责alculate the Sums of weights for each label, for each feature
第四个：BayesThetaNormalizerDriver负责：Calculate the normalization factor Sigma_W_ij for each complement class
下面分别分析这个四个类：
一个map/reduce类：BayesFeatureDriver
所在包：package org.apache.mahout.classifier.bayes.mapreduce.common;
输出key类型：StringTuple.class
输出value类型：DoubleWritable.class
输入格式：KeyValueTextInputFormat.class
输出格式：BayesFeatureOutputFormat.class
MAP：BayesFeatureMapper.class
REDUCE：BayesFeatureReducer.class
注意：BayesFeatureDriver可以独立运行，默认的输入和输出：
input = new Path("/home/drew/mahout/bayes/20news-input");
output = new Path("/home/drew/mahout/bayes/20-news-features");
p = new BayesParameters(1) gramsize默认为1
输出会生成三个文件
$OUTPUT/trainer-termDocCount
$OUTPUT/trainer-wordFreq
$OUTPUT/trainer-featureCount
下来的第二个map/reduce类BayesTfIdfDriver会根据这第一个的输出文件计算TF-IDF值，计算完毕后会删除这三个中间文件，并生成文件：trainer-tfIdf保存文本特征的it-idf值，
第三个：BayesWeightSummerDriver
输出key：StringTuple.class
输出value:DoubleWritable.class
输入路径：就是第二个map/reduce生成的trainer-tfIdf文件
输出：trainer-weights文件
输入文件格式：SequenceFileInputFormat.class
输出文件格式：BayesWeightSummerOutputFormat.class
第四个job：BayesThetaNormalizerDriver
输出key：StringTuple.class
输出value:DoubleWritable.class
输入路径：FileInputFormat.addInputPath(conf, new Path(output, "trainer-tfIdf/trainer-tfIdf"));就是需要使用第二个job的输出： trainer-tfIdf文件
输出路径：Path outPath = new Path(output, "trainer-thetaNormalizer");
会生成文件： trainer-thetaNormalizer
输出文件格式：SequenceFileOutputFormat.class
这个四个job执行完毕后整个bayes模型就建立完毕了，最后总共生成并保存三个目录文件：
trainer-tfIdf
trainer-thetaNormalizer
trainer-weights
模型建好了，下来就是测试分类器的效果
调用类：TestClassifier
所在包：package org.apache.mahout.classifier.bayes;
根据命令行参数会选择顺序执行还是并行map/reduce执行
并行执行回调用BayesClassifierDriver类
下面分析BayesClassifierDriver类
所在包：package org.apache.mahout.classifier.bayes.mapreduce.bayes;
输入格式：KeyValueTextInputFormat.class
输出格式：SequenceFileOutputFormat.class
执行完毕后会调用混合矩阵： ConfusionMatrix函数显示结果