Java---获取文本信息熵

最新推荐文章于 2021-03-01 17:07:49 发布

一个叫欧维的程序员在此写博客

最新推荐文章于 2021-03-01 17:07:49 发布

阅读量705

点赞数

分类专栏： Java 文章标签：信息熵文本

本文链接：https://blog.csdn.net/qq_41982466/article/details/99620227

版权

Java 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

Java---获取文本信息熵

本文实现方法：在已知语料（content）下获取文本（element）的信息熵。
实例：我们获取了一个网站内的所有文本（content），要计算其中每个标题（element）的信息熵值。
过程：
首先，对语料（content）进行分词过滤，统计每个词的词频，占比，以此作为该词的概率值。
然后，对要计算信息熵的文本（element）进行同样的分词操作，根据分词获取对应的概率值，计算信息熵。
代码如下：

public static double getInfoEntropy(String content,String element) {
		
		//对语料进行分词、过滤
		Result fliterContent = fliterString(content);
		
		//语料的词频统计
		Map<String,Integer> map=new HashMap<String,Integer>();
		Integer count,value,sum = 0;
		for (Term term : fliterContent) {
			count = map.get(term.getName());
			if (count == null) {
				map.put(term.getName(), 1);
			} else {
				map.put(term.getName(), count + 1);
			}
			sum++;
		}
		//对文本进行分词、过滤
		Result fliterElement = fliterString(element);
		//计算信息熵
		double infoentropy = .0;
		for (Term term : fliterElement) {
			String key = term.getName();
			value = map.get(key);
			//计算该次在全部语料下的频率
			try {
				float frequency = (float) value / sum;
				infoentropy += -frequency * (Math.log(frequency) / Math.log(2));
			} catch (Exception e) {
			}
		}
		return infoentropy/fliterElement.size();
	}

分词、过滤函数如下：

public static Result fliterString(String content) {
		
		StopRecognition filter = new StopRecognition();
		
		filter.insertStopNatures("w"); //过滤标点
		filter.insertStopNatures("null"); //过滤空格
		filter.insertStopNatures("m"); //过滤数值类词，会将 “半数” 这类中文数词过滤
//		filter.insertStopWords("的"); //过滤单词
//		filter.insertStopNatures("w"); //过滤词性 
//		根据正则表达式进行过滤
//		filter.insertStopRegexes("\\.*");
//		filter.insertStopRegexes("\\[");
//		filter.insertStopRegexes("\\]");
//		filter.insertStopRegexes("\\-");
//		filter.insertStopRegexes("\\s*");
		
		//返回分词、过滤结果
		return ToAnalysis.parse(content).recognition(filter);
	}

分词所依赖的jar包依赖如下：

		<dependency>
			<groupId>org.ansj</groupId>
			<artifactId>ansj_seg</artifactId>
			<version>5.1.3</version>
		</dependency>

一个叫欧维的程序员在此写博客

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Java---获取文本信息熵

Java--获取文本信息熵本文实现方法：在已知语料（content）下获取文本（element）的信息熵。实例：我们获取了一个网站内的所有文本（content），要计算其中每个标题（element）的信息熵值。过程：首先，对语料（content）进行分词过滤，统计每个词的词频，占比，以此作为该词的概率值。然后，对要计算信息熵的文本（element）进行同样的分词操作，根据分词获取对应的概...
复制链接

扫一扫

专栏目录