书面作业那边修改不了,就发个帖子把作业做了吧我用的是Hadoop2.4.0+Mahout0.9+Pig0.13.0,所以课件中的方法都不适用了,要全部重新搞
首先是用sport构造bayes和cbayes分类器,这步可以直接参照examples/binclassify-20newsgroups.sh通过Mahout脚本实现
复制代码
构造好分类器,接下来就是如何用它去进行分类,这一步浪费了好多时间搜资料、看源代码,中间也遇到了很多问题,就不细说了,最终在老师提供的源码基础上做了些修改终于整出一个结果,当然还有许多需要优化的地方,欢迎大家共同讨论完善
1、先对user-sport序列化并进行分词处理
复制代码
2、修改
ClassifierDriver类,可参照mahout0.9源码中core里的
org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver
这步本来想像mahout一样通过-i、-o这种方式传参数进去的,但是 getOption的时候总是报空指针异常,也没找到是哪的问题,所以直接把参数都写死在程序里了
复制代码
3、修改Mapper,调用模型分类,并且按用户、类别进行计数
复制代码
4、Reducer部分没有修改,输出格式跟讲课内容一致,用户ID|类别|次数,部分输出结果如下
复制代码
5、接下来就没什么难度了,用Pig计算每个用户浏览各类文章的占比
复制代码
部分输出结果
复制代码
首先是用sport构造bayes和cbayes分类器,这步可以直接参照examples/binclassify-20newsgroups.sh通过Mahout脚本实现
- ##1.将分类文章序列化
- mahout seqdirectory
- -i /bayes/sport
- -o /bayes/sport-seq -ow
-
- ##2.分词处理
- mahout seq2sparse
- -i /bayes/sport-seq
- -o /bayes/sport-vectors -lnorm -nv -wt tfidf
-
- ##3.拆分成训练集和测试集
- mahout split
- -i /bayes/sport-vectors/tfidf-vectors
- --trainingOutput /bayes/sport-train-vectors
- --testOutput /bayes/sport-test-vectors
- --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
-
- ##4.训练bayes分类器
- mahout trainnb
- -i /bayes/sport-train-vectors -el
- -o /bayes/model
- -li /bayes/labelindex
- -ow
-
- ##5.测试bayes分类器
- mahout testnb
- -i /bayes/sport-test-vectors
- -m /bayes/model
- -l /bayes/labelindex
- -ow -o /bayes/sport-testing
-
- /*
- =======================================================
- Summary
- -------------------------------------------------------
- Correctly Classified Instances : 3824 95.5761%
- Incorrectly Classified Instances : 177 4.4239%
- Total Classified Instances : 4001
-
- =======================================================
- Confusion Matrix
- -------------------------------------------------------
- a b c d e f g h i j <--Classified as
- 372 0 0 0 0 3 2 2 0 0 | 379 a = badminton
- 3 376 4 3 6 11 1 4 1 2 | 411 b = basketball
- 16 0 428 1 2 15 1 1 6 2 | 472 c = billiards
- 1 0 1 390 0 7 0 0 1 0 | 400 d = f1
- 0 0 0 0 405 2 1 0 3 0 | 411 e = football
- 2 0 3 2 3 350 0 0 7 0 | 367 f = golf
- 0 0 0 0 1 1 381 1 0 0 | 384 g = pingpong
- 4 0 0 1 0 5 0 372 0 1 | 383 h = swim
- 7 1 1 2 0 12 3 7 372 1 | 406 i = tennis
- 1 1 0 0 1 1 0 2 4 378 | 388 j = volleyball
-
- =======================================================
- Statistics
- -------------------------------------------------------
- Kappa 0.9346
- Accuracy 95.5761%
- Reliability 87.0107%
- Reliability (standard deviation) 0.2902
- */
-
- ##6.训练cbayes分类器
- mahout trainnb
- -i /bayes/sport-train-vectors -el
- -o /bayes/model-cbayes
- -li /bayes/labelindex
- -ow -c
-
- ##7.测试cbayes分类器
- mahout testnb
- -i /bayes/sport-test-vectors
- -m /bayes/model-cbayes
- -l /bayes/labelindex
- -ow -o /bayes/sport-c-testing
- -c
-
- /*
- =======================================================
- Summary
- -------------------------------------------------------
- Correctly Classified Instances : 3815 95.3512%
- Incorrectly Classified Instances : 186 4.6488%
- Total Classified Instances : 4001
-
- =======================================================
- Confusion Matrix
- -------------------------------------------------------
- a b c d e f g h i j <--Classified as
- 375 0 0 1 0 1 0 2 0 0 | 379 a = badminton
- 0 387 1 6 10 4 1 2 0 0 | 411 b = basketball
- 10 1 435 0 4 13 2 0 6 1 | 472 c = billiards
- 1 0 1 394 1 3 0 0 0 0 | 400 d = f1
- 0 0 1 2 406 1 0 0 0 1 | 411 e = football
- 1 0 1 4 10 348 0 0 3 0 | 367 f = golf
- 0 0 0 0 1 0 382 1 0 0 | 384 g = pingpong
- 1 0 0 1 0 3 0 378 0 0 | 383 h = swim
- 6 3 5 7 11 17 2 6 349 0 | 406 i = tennis
- 2 3 0 3 6 2 5 2 4 361 | 388 j = volleyball
-
- =======================================================
- Statistics
- -------------------------------------------------------
- Kappa 0.9329
- Accuracy 95.3512%
- Reliability 86.7771%
- Reliability (standard deviation) 0.2907
- */
构造好分类器,接下来就是如何用它去进行分类,这一步浪费了好多时间搜资料、看源代码,中间也遇到了很多问题,就不细说了,最终在老师提供的源码基础上做了些修改终于整出一个结果,当然还有许多需要优化的地方,欢迎大家共同讨论完善
1、先对user-sport序列化并进行分词处理
- mahout seqdirectory
- -i /bayes/user-sport
- -o /bayes/us-seq -ow
-
- mahout seq2sparse
- -i /bayes/us-seq
- -o /bayes/us-vectors -lnorm -nv -wt tfidf
这步本来想像mahout一样通过-i、-o这种方式传参数进去的,但是 getOption的时候总是报空指针异常,也没找到是哪的问题,所以直接把参数都写死在程序里了
- public class ClassifierDriver extends AbstractJob {
-
- public static void main(String[] args) throws Exception {
- ToolRunner.run(new Configuration(), new ClassifierDriver(),args);
- }
-
- @Override
- public int run(String[] args) throws Exception {
- // set bayes parameter
- addInputOption();
- addOutputOption();
- addOption("model", "m", "The path to the model built during training",
- true);
- addOption("labelIndex","labelIndex", "The file where the index store ");
- addOption("labelNumber", "ln", "The labels number ");
-
- //Path input = getInputPath();
- //Path output = getOutputPath();
- Path input = new Path("/bayes/us-vectors/tfidf-vectors");//分词后的user-sport
- Path output = new Path("/bayes/user-output");
- //String labelNumber = getOption("labelNumber");
- String labelNumber = "10";//分类个数
- //String modelPath = getOption("model");
- String modelPath = "/bayes/model-cbayes";//贝叶斯分类器模型
- String labelIndex = "/bayes/labelindex";
- Configuration conf = getConf();
- conf.set(WeightsMapper.class.getName() + ".numLabels", labelNumber);
- conf.set("labelIndex", labelIndex);
- HadoopUtil.cacheFiles(new Path(modelPath), conf);
- HadoopUtil.delete(conf, output);
- Job job = new Job(conf);
- job.setJobName("Use bayesian model to classify the input:"
- + input.getName());
- job.setJarByClass(ClassifierDriver.class);
-
- job.setInputFormatClass(SequenceFileInputFormat.class);
-
- //set Map&Reduce
- job.setMapperClass(ClassifierMapper.class);
- job.setReducerClass(ClassifierReducer.class);
-
- job.setOutputKeyClass(NullWritable.class);
- job.setOutputValueClass(Text.class);
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(IntWritable.class);
-
- SequenceFileInputFormat.setInputPaths(job, input);
- SequenceFileOutputFormat.setOutputPath(job, output);
-
- if (job.waitForCompletion(true)) {
- return 0;
- }
- return -1;
- }
- }
- public class ClassifierMapper extends Mapper<Text, VectorWritable, Text, IntWritable> {
-
- private Text outKey = new Text();
- private static final IntWritable ONE = new IntWritable(1);
-
- private AbstractNaiveBayesClassifier classifier;
- private String labelIndex;
- private Map<Integer, String> labelMap;
- /**
- * Parallel Classification
- *
- * @param key
- * The label
- * @param value
- * the features (all unique) associated w/ this label
- * @param context
- */
- public void map(Text key, VectorWritable value, Context context) throws IOException,
- InterruptedException {
-
- String userID = key.toString().split("/")[1];
- Vector result = classifier.classifyFull(value.get());
- String label = BayesUtil.classifyVector(result, labelMap);
- // key is userID and label
- outKey.set(userID + "|" + label);
- context.write(outKey, ONE);
- }
-
- /**
- * read the model
- *
- * @throws IOException
- */
- @Override
- public void setup(Context context) throws IOException {
- System.out.println("Setup");
- Configuration conf = context.getConfiguration();
- Path modelPath = HadoopUtil.getSingleCachedFile(conf);
- labelIndex=conf.get("labelIndex");
- labelMap = BayesUtils.readLabelIndex(conf, new Path(labelIndex));
- NaiveBayesModel model = NaiveBayesModel.materialize(modelPath, conf);
- classifier = new StandardNaiveBayesClassifier(model);
- }
- 10511838|badminton|19
- 10511838|basketball|1
- 10511838|f1|10
- 10511838|football|4
- 10511838|golf|25
- 10511838|swim|1
- 10511838|tennis|1
- 10511838|volleyball|1
- 10564290|badminton|25
- 10564290|basketball|3
- 10564290|f1|5
- 10564290|football|2
- 10564290|golf|46
- 10564290|pingpong|5
- 10564290|tennis|1
- 10564290|volleyball|1
- --读入数据
- u_ct = load '/bayes/user-output' using PigStorage('|') as (user:chararray, category:chararray, times:int);
-
- --分用户计数
- u_count = foreach(group u_ct by user)
- {
- generate flatten(group) as user, SUM(u_ct.times) as sum;
- };
-
- --表连接
- u_join = join u_ct by user,u_count by user USING 'replicated';
-
- --计算百分比
- u_out = foreach u_join
- generate u_ct::user as user,u_ct::category as category,u_ct::times as times,
- (double)u_ct::times/(double)u_count::sum*100 as p;
-
- --输出结果
- store u_out into '/bayes/user-sport-out';
- 10511838 badminton 19 30.64516129032258
- 10511838 basketball 1 1.6129032258064515
- 10511838 f1 10 16.129032258064516
- 10511838 football 4 6.451612903225806
- 10511838 golf 25 40.32258064516129
- 10511838 swim 1 1.6129032258064515
- 10511838 tennis 1 1.6129032258064515
- 10511838 volleyball 1 1.6129032258064515
- 10564290 badminton 25 28.40909090909091
- 10564290 basketball 3 3.4090909090909087
- 10564290 f1 5 5.681818181818182
- 10564290 football 2 2.272727272727273
- 10564290 golf 46 52.27272727272727
- 10564290 pingpong 5 5.681818181818182
- 10564290 tennis 1 1.1363636363636365
- 10564290 volleyball 1 1.1363636363636365