软件工程-关于词频统计程序设计实现分析

需求:

写一个程序,读取一个30kb-300kb文本文件,统计其中词语的个数,并对其中高频词汇进行统计分析。

需求分析:

如果只简单实现程序功能,程序比较小,采用java语言,windows8开发环境,eclipse开发工具,VisualVM测试软件进行程序分析即可。由需求可知,程序需要的功能有,读取文件,统计单词,筛选出无意义的词语,排序,展示结果。

概要设计:

设计一个WordStatistics类,其中包含读取文件的功能,统计单词的功能,和去除无意义的词,排序,展示功能。在用一个测试类,直接对该类进行使用。

详细设计:

WordStatistic需要读取文本文件,在java中有字节流和字符流两种读取方式,我选择对其英语单词文本进行统计,所以选择字符流读取文本的方式。我采用了inputStream 进行读取,用bufferedReader进行包装,用JAVA中的StringTokenizer工具进行分词。用Map集合的key-value键值对容器装取分出的单

词,key代表单词,value为单词出现的次数。在装的同时,筛选出无意义的词语。最后用Array,按照value的值的大小进行排序。

键算法:

关键代码:

<span style="font-family:Microsoft YaHei;font-size:14px;"><span style="font-family:Microsoft YaHei;font-size:12px;">package com.coffee.statistics;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;

public class WordStatistics {
	private String path;
	private String words = "";
	private String partword;
	private HashMap<String, Integer> wordsmap = new HashMap<String, Integer>();

	public WordStatistics() {

	}

	public WordStatistics(String path) {
		this.path = path;
	}

	public void setPath(String path) {
		this.path = path;
	}

	public void statistics() {
		try {
			FileInputStream inputStream = new FileInputStream(new File(path));
			BufferedReader bufferedReader = new BufferedReader(
					new InputStreamReader(inputStream));
			while ((partword = bufferedReader.readLine()) != null) {
				words += partword;
			}
			StringTokenizer tokenizer = new StringTokenizer(words,
					"\n\r\t .,\"?!:'");
			System.out.println("单词数量" + tokenizer.countTokens());
			while (tokenizer.hasMoreElements()) {
				String key = tokenizer.nextToken();
				if (!isMeaning(key)) {
					key = tokenizer.nextToken();
				} else {
					int value = 1;
					// 添加筛选,将无效词汇去除
					if (wordsmap.get(key) != null) {
						value = (Integer) wordsmap.get(key).intValue();
						value++;
						wordsmap.put(key, value);
					} else {
						wordsmap.put(key, new Integer(value));

					}
				}
			}
			// 对集合根据Value值排序
			Set<?> sortset = wordsmap.entrySet();
			Map.Entry[] entrys = (Map.Entry[]) sortset
					.toArray(new Map.Entry[sortset.size()]);
			// 排序
			Arrays.sort(entrys, new Comparator<Object>() {

				public int compare(Object o1, Object o2) {
					Long key1 = Long.valueOf(((Map.Entry) o1).getValue()
							.toString());
					Long key2 = Long.valueOf(((Map.Entry) o2).getValue()
							.toString());
					return key1.compareTo(key2);
				}

			});

			// 遍历Map.Entry[]
			System.out.println("高频词汇前二十");
			for (int i = entrys.length - 1; i >= entrys.length - 20; i--) {
				System.out.println(entrys[i].getKey() + ":"
						+ entrys[i].getValue());
			}
			bufferedReader.close();
			inputStream.close();
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		
	}

	// 判断读取的词语是否有意义
	private boolean isMeaning(String key) {
		String[] unmeaning = { "the", "I", "of", "and", "to", "gave", "So",
				"And", "was", "you", "our", "You", "your", "he", "He", "me",
				"we", "We", "their", "thiers", "not", "when", "over", "what",
				"then", "where", "there", "here", "who", "a", "an", "his",
				"in", "her", "if", "If", "Do", "do", "said", "did", "day",
				"took", "on", "or", "had", "for", "by", "at", "that", "she",
				"She", "as", "it", "they", "one", "them", "go", "because",
				"saw", "give", "him", "with", "into", "men", "t", "human",
				"be", "is", "are", "am", "were", "from", "came", "Then",
				"must", "told", "but", "no", "form", "up", "all", "so", "down",
				"after", "before", "off", "out", "got", "have", "may", "any",
				"asked", "himself", "my", "became", "herself", "back", "s",
				"been", "one", "two", "will", "would", "gave", "made", "make",
				"let", "Let" };
		if (key != null) {
			for (int i = 0; i < unmeaning.length; i++) {
				if (key.endsWith(unmeaning[i])) {
					return false;
				}
			}
			return true;
		}
		return false;
	}
}</span></span>


程序运行结果截图:



选取的文本内容艾是圣经的英文版

程序测试:

<span style="font-family:Microsoft YaHei;font-size:14px;">package com.coffee.statistics;
/**
 *单词统计测试类
 * @author apple
 *
 */
public class StatisticTest {
	public static void main(String[] args) {
		for(int i=0;i<100;i++){
		String path="C:/Users/coffee/Desktop/englishText.txt";
		WordStatistics statistics = new WordStatistics(path);
		statistics.statistics();
		}
	}
}
</span>
循环100次测试其结果:

CPU使用情况:  堆内存的使用: 

               


类: 线程:

           

内存使用:

线程表:


此程序只简单的实现了其功能,算法非常简单,运用到的数据结构也只是普通的Map和Array集合,对于词频统计,我还看到过关于树的一些应用,以后再继续对其深究。



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值