Hadoop基于朴素贝叶斯分类器的情感判别实践

最新推荐文章于 2023-01-08 18:01:05 发布

丨人间有味是清欢

最新推荐文章于 2023-01-08 18:01:05 发布

阅读量2.1k

点赞数 8

分类专栏： Hadoop 文章标签： hadoop java eclipse linux 大数据

本文链接：https://blog.csdn.net/qq_40673864/article/details/103929677

版权

Hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1.基础材料说明

1）训练集training.txt文件

该文件是一个大小为75.8MB的文本数据集，并包含了20,000,000条数据记录，每行数据中包含的信息为“评价结论\t 评价内容”。其中，“评价内容”是若干词语组合而成，词语之间是空格隔开，词语包括中文、英文以及其他特殊符号，即其内容为“word1 word2 word3 word4 …… wordn”，其中wordi表示当前文本描述中的第i个词，n为当前文本描述中包含的总词数。

2）测试集test.txt文件
给定“test.data”数据集，该数据集包含了2000条记录，每行记录中包含的信息为“评价内容”，该“评价内容”的具体表现形式与前文描述的“training.txt”数据集相同。

3）文件命名说明
本工程实践为课程作业，故项目名为NB_2017082040，HDFS储存路径为hdfs://master:9000/input_2017082040/training.txt。

4）运行环境说明
编译器：eclipse-jee-2019-12-R-linux-gtk-x86_64.tar.gz
Hadoop：Hadoop 2.7.7

2.实现过程

2.1 上传文件至HDFS

hadoop fs -put /root/Documents/training.txt hdfs://master:9000/input/training.txt

2.2 创建工程

基于Eclipse并使用Maven创建Hadoop工程。Maven配置过程略。创建Maven项目后修改pom.xml文件。

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>Hadoop</groupId>
  <artifactId>NB_2017082040</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>NB_2017082040</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.7</version>
        </dependency>
  </dependencies>
</project>

2.3 训练模型

通过MapReduce实现数据清洗和取词建立训练模型
关键代码

package Hadoop.NB_2017082040;

import java.io.IOException;
import java.util.regex.Pattern;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class feelMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
	private final IntWritable one = new IntWritable(1);

	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		String data = value.toString();
		// 获得每行文档内容，进行拆分
		String content[] = data.split("\t", 2);
		// 获取情感标签:content[0]
		String label = content[0];
		// 获取当前行的特征
		String features[] = content[1].split(" ");
		// 清洗数据并统计
		for (String feature : features) {
			// 清洗数据
			if (Pattern.matches("[\u4e00-\u9fa5]+", feature)) {
				// 输出一次该类别下特征计数
				context.write(new Text(label + "_" + feature), one);
			}
		}
		// 输出情感标签
		context.write(new Text(label), one);
	}
}

package Hadoop.NB_2017082040;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class feelReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
	public void reduce(Text key, Iterable<IntWritable> values, Context context)
			throws IOException, InterruptedException {
		int count = 0;
		for (IntWritable val : values) {
			count += val.get();
		}
		context.write(key, new IntWritable(count));
	}
}

2.4 朴素贝叶斯实现

朴素贝叶斯采用多项式模型，拉普拉斯平滑减少误差。

在多项式模型中，设某文档d=(t1,t2,…,tk)，tk是该文档中出现过的单词，允许重复，则

先验概率P(c)= 类c下单词总数/整个训练样本的单词总数

类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|)

V是训练样本的单词表（即抽取单词，单词出现多次，只算一个），|V|则表示训练样本包含多少种单词。在这里，m=|V|, p=1/|V|。

P(tk|c)可以看作是单词tk在证明d属于类c上提供了多大的证据，而P©则可以认为是类别c在整体上占多大比例(有多大可能性)。

给定一组分类好了的文本训练数据，如下：

给定一个新样本Chinese Chinese Chinese Tokyo Japan，对其进行分类。

该文本用属性向量表示为d=(Chinese, Chinese, Chinese, Tokyo, Japan)，类别集合为Y={yes,
no}。

类yes下总共有8个单词，类no下总共有3个单词，训练样本单词总数为11，因此P(yes)=8/11,
P(no)=3/11。类条件概率计算如下：

P(Chinese | yes)=(5+1)/(8+6)=6/14=3/7
//类yes下单词Chinese在各个文档中出现过的次数之和+1/类yes下单词的总数(8)+总训练样本的不重复单词(6)

P(Japan | yes)=P(Tokyo | yes)= (0+1)/(8+6)=1/14

P(Chinese|no)=(1+1)/(3+6)=2/9

P(Japan|no)=P(Tokyo| no) =(1+1)/(3+6)=2/9

分母中的8，是指yes类别下textc的长度，也即训练样本的单词总数，6是指训练样本有Chinese,Beijing,Shanghai,
Macao, Tokyo, Japan 共6个单词，3是指no类下共有3个单词。

有了以上类条件概率，开始计算后验概率，

P(yes | d)=(3/7)3×1/14×1/14×8/11=108/184877≈0.00029209 //Chinese
Chinese Chinese Tokyo Japan

P(no | d)= (2/9)3×2/9×2/9×3/11=32/216513≈0.00014780

因此，这个文档属于类别china。

关键算法实现

	/**
	 * 预测模型采用多项式模型
	 */
	private static String modelFilePath = "/root/Documents/2017082040_模型.txt";
	private static String testDataFilePath = "/root/Documents/test.txt";
	public static HashMap<String, Integer> parameters = null; // 情感标签集
	public static double Nd = 0.;// 文件中的总记录数
	public static HashMap<String, Integer> allFeatures = null;// 整个训练样本的键值对
	public static HashMap<String, Double> labelFeatures = null;// 某一类别下所有词出现的总数
	public static HashSet<String> V = null;// 总训练样本的不重复单词

	/**
	 * 对训练数据进行二次处理，得到多项式模型
	 */
	public static void loadModel(String modelFile) throws Exception {
		if (parameters != null && allFeatures != null) {
			return;
		}
		parameters = new HashMap<String, Integer>();// 情感标签集
		allFeatures = new HashMap<String, Integer>();// 全部属性对
		labelFeatures = new HashMap<String, Double>();// 某一类别下所有词出现的总数
		V = new HashSet<String>();
		BufferedReader br = new BufferedReader(new FileReader(modelFile));
		String line = null;
		while ((line = br.readLine()) != null) {
			String feature = line.substring(0, line.indexOf("\t"));
			Integer count = Integer.parseInt(line.substring(line.indexOf("\t") + 1));
			if (feature.contains("_")) {
				allFeatures.put(feature, count);
				String label = feature.substring(0, feature.indexOf("_"));
				if (labelFeatures.containsKey(label)) {
					labelFeatures.put(label, labelFeatures.get(label) + count);
				} else {
					labelFeatures.put(label, (double) count);
				}
				String word = feature.substring(feature.indexOf("_") + 1);
				if (!V.contains(word)) {
					V.add(word);
				}
			} else {
				parameters.put(feature, count);
				Nd += count;
			}
		}
		br.close();
	}

	/**
	 * 计算条件概率
	 */
	public static String predict(String sentence, String modelFile) throws Exception {
		loadModel(modelFile);
		String predLabel = null;
		double maxValue = Double.NEGATIVE_INFINITY;// 最大类概率（默认值为负无穷小）
		String[] words = sentence.split(" ");
		Set<String> labelSet = parameters.keySet(); // 获得标签集
		for (String label : labelSet) {
			double tempValue = Math.log(parameters.get(label) / Nd);// 先验概率
			/**
			 * 先验概率P(c)= 类c下单词总数/整个训练样本的单词总数 parameters .get(label):类别c对应的文档在训练数据集中的计数
			 * Nd:整个训练样本的单词总数
			 */
			for (String word : words) {
				String lf = label + "_" + word;
				// 计算最大似然概率
				if (allFeatures.containsKey(lf)) {
					tempValue += Math.log((double) (allFeatures.get(lf) + 1) / (labelFeatures.get(label) + V.size()));
					/**
					 * 多项式原理 类条件概率P(tk|c)=(类c下单词tk在各个文档中出现过的次数之和+1)/(类c下单词总数+|V|)
					 * allFeatures.get(lf)：类别c与词语 w共同出现的次数 labelFeatures.get(label) +
					 * V.size()：类别c下属性总数+该训练文本中词语总数 Laplace Smoothing处理未出现在训练集中的数据 +1
					 */
				} else {
					tempValue += Math.log((double) (1 / (labelFeatures.get(label) + V.size())));
				}
			}
			if (tempValue > maxValue) {
				maxValue = tempValue;
				predLabel = label;
			}
		}
		return predLabel;
	}

具体工程文件下载：https://download.csdn.net/download/qq_40673864/12095749

丨人间有味是清欢

关注

8
点赞
踩
32

收藏

觉得还不错? 一键收藏
0
评论
Hadoop基于朴素贝叶斯分类器的情感判别实践

1.基础材料说明1）训练集training.txt文件该文件是一个大小为75.8MB的文本数据集，并包含了20,000,000条数据记录，每行数据中包含的信息为“评价结论\t 评价内容”。其中，“评价内容”是若干词语组合而成，词语之间是空格隔开，词语包括中文、英文以及其他特殊符号，即其内容为“word1 word2 word3 word4 …… wordn”，其中wordi表示当前文本描述中的...
复制链接

扫一扫