【大数据基础】统计某目录下的所有文本文件的单词出现频率

最新推荐文章于 2022-05-27 22:03:38 发布

weixin_30933531

最新推荐文章于 2022-05-27 22:03:38 发布

阅读量233

点赞数

文章标签：大数据 shell

原文链接：http://www.cnblogs.com/luoyesiqiu/p/9412729.html

版权

思路

1.设置一个全局表来存放每一个出现过的单词和它的出现次数
2.遍历所有文件，根据文件类型来判断是不是我们想要读取的文件
3.读取文件内容
4.把文件内容分割成一个个单词，并将文件中出现的单词，以及出现的次数存到全局表中
5.根据出现次数从大到小排序数据
6.打印出结果

实现

1.设置一个全局表来存放每一个出现过的单词和它的出现次数

这里用TreeMap方便我们排序

private static Map<String, Integer> table=new TreeMap<String, Integer>();

2.遍历所有文件，根据文件类型来判断是不是我们想要读取的文件

通过递归遍历目录下的所有文件，我们只读取c,cpp,java为后缀的文件。

    private static void statisticDir(File dir) {
        if(dir.isFile()) {
            return;
        }
        File[] fs=dir.listFiles();
        if(fs==null) {
            return;
        }
        for (File f:fs)
        {
            if (f.isFile())
            {
                String full=f.getAbsolutePath();
                if(full.endsWith(".c")||full.endsWith(".cpp")||full.endsWith(".java")) {
                    statisticFile(full);
                }
            }
            else{
                System.out.println("扫描："+f.getAbsolutePath());//想快点可以把这行注释掉
                statisticDir(f);
            }

        }

    }

3.读取文件内容

    private static StringBuilder readFile(String file) {
        BufferedReader bufferedReader=null;
        StringBuilder stringBuilder=new StringBuilder();
        try {
            bufferedReader=new BufferedReader(new FileReader(file));
            String line=null;
            while ((line=bufferedReader.readLine())!=null) {
                
                stringBuilder.append(line+"\n");
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        finally {
            if(bufferedReader!=null) {
                try {
                    bufferedReader.close();
                } catch (IOException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
            }
        }
        return stringBuilder;
    }

4.把文件内容分割成一个个单词，并将文件中出现的单词，以及出现的次数存到全局表中

这里将忽略单个字母的单词，支持大驼峰和小驼峰命名的单词组合拆分，例如：void setName(name);或者void SetName(Name);将被分解成4个单词。

    private static void statisticWordsBySentence(String sentence){
        int start=0;
        int end=0;
        String word=null;
        boolean scan=false;
        
        int len=sentence.length();
        State state=State.Other;
        for(int i=0;i<len;i++) {
            char ch=sentence.charAt(i);
            
            //小写字母
            if(Character.isLowerCase(ch)) {
                if(!scan){
                    start=i;
                    scan=true;
                }
                //根据前一个状态判断
                if(state==State.Other){
                    state=State.LowerCase;
                    scan=true;
                }
                else if(state==State.LowerCase||state==State.UpperCase){
                    if(i==len-1){
                        end=len;
                        if(end-start==1) {
                            //不要一个字母的单词
                            scan=false;
                            continue;
                        }
                        word=sentence.substring(start, end);
                        if(!sensitive){
                            word=word.toLowerCase();
                        }
                        if(table.containsKey(word)) {
                            int newVal=table.get(word)+1;
                            table.put(word, newVal);
                        }
                        else {
                            table.put(word, 1);
                        }
                    }
                }
                
            }
            //大写字母
            else if(Character.isUpperCase(ch)) {
                if(!scan){
                    start=i;
                    scan=true;
                    state=State.UpperCase;
                    continue;
                }
                if(state==State.LowerCase||state==State.UpperCase||state==State.Other){
                    end=(i==len-1)?len:i;
                    if(end-start==1) {
                        //不要一个字母的单词
                        scan=false;
                        continue;
                    }
                    word=sentence.substring(start, end);
                    if(!sensitive){
                        word=word.toLowerCase();
                    }
                    if(table.containsKey(word)) {
                        int newVal=table.get(word)+1;
                        table.put(word, newVal);
                    }
                    else {
                        table.put(word, 1);
                    }
                    state=State.UpperCase;
                    scan=true;
                    start=i;
                }
                
                
            }
            //其他
            else{
                if(!scan){
                    scan=false;
                    continue;
                }
                if(state!=State.Other){
                    end=(i==len-1)?len:i;
                    if(end-start==1) {
                        //不要一个字母的单词
                        scan=false;
                        continue;
                    }
                    word=sentence.substring(start, end);
                    if(!sensitive){
                        word=word.toLowerCase();
                    }
                    if(table.containsKey(word)) {
                        int newVal=table.get(word)+1;
                        table.put(word, newVal);
                    }
                    else {
                        table.put(word, 1);
                    }
                    state=State.Other;
                    scan=false;
                }
            }
            
        }
    }

5.根据出现次数从大到小排序数据

    private static List<Map.Entry<String, Integer>> sortData() {
        List<Map.Entry<String, Integer>> entryArrayList = new ArrayList<>(table.entrySet());
        Collections.sort(entryArrayList, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> v1, Map.Entry<String, Integer> v2) {
                // TODO Auto-generated method stub
                return v2.getValue()-v1.getValue();
            }
        });
        
        return entryArrayList;
    }

6.打印出结果

拼成Markdown的表格，使打印更加直观。在这里输出前100的排名，读者可自行调整

        System.out.println("| 排名 | 单词 | 出现频率 |");
        System.out.println("| ------------- |:-------------:| --------:|");
        for(int i=0;i<100;i++) {
            Map.Entry<String, Integer> entry=list.get(i);
            System.out.println("| "+(i+1)+" | "+entry.getKey()+" | "+entry.getValue()+" |");
        }

结果

在测试的时候，在大小写敏感模式下，统计某Java源码目录的结果，得到的结果如下：

排名	单词	出现频率
1	the	311620
2	if	160965
3	int	147354
4	to	124752
5	ud	122707
6	return	120929
7	is	103377
8	of	97253
9	public	82258
10	code	80901
11	get	80374
12	in	78338
13	this	72584
14	for	66639
15	void	66632
16	const	65662
17	String	61459
18	and	60536
19	static	58577
20	be	52238
21	new	52176
22	value	50750
23	set	48107
24	define	46341
25	or	44280
26	final	44272
27	The	43007
28	null	40870
29	param	39200
30	ua	35429
31	Exception	35177
32	not	33046
33	that	32959
34	with	31674
35	char	31605
36	private	30663
37	name	30092
38	by	28578
39	else	28552
40	on	27356
41	data	27117
42	link	26914
43	type	26831
44	length	26330
45	an	25982
46	License	25953
47	class	25941
48	android	24778
49	udc	24752
50	Code	24273
51	This	24032
52	ude	23355
53	key	22997
54	from	22842
55	are	22709
56	Object	22363
57	result	22353
58	Unicode	22143
59	import	22070
60	as	21689
61	it	21530
62	td	21388
63	size	21351
64	Array	20837
65	status	20546
66	file	20100
67	case	20058
68	udd	19954
69	Type	19706
70	index	19660
71	View	19528
72	use	19258
73	include	18985
74	Name	18622
75	object	18118
76	start	18109
77	boolean	17693
78	Value	17223
79	will	16990
80	out	16947
81	error	16763
82	Set	16560
83	may	16362
84	To	16344
85	string	16319
86	err	16081
87	true	15680
88	throws	15655
89	endif	15619
90	unsigned	15532
91	long	15313
92	udf	15279
93	at	15106
94	ctx	14984
95	State	14939
96	Info	14749
97	If	14680
98	block	14453
99	false	14434
100	used	14247

完整源码

https://github.com/luoyesiqiu/StatisticWords

转载于:https://www.cnblogs.com/luoyesiqiu/p/9412729.html

weixin_30933531

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【大数据基础】统计某目录下的所有文本文件的单词出现频率

思路1.设置一个全局表来存放每一个出现过的单词和它的出现次数2.遍历所有文件，根据文件类型来判断是不是我们想要读取的文件3.读取文件内容4.把文件内容分割成一个个单词，并将文件中出现的单词，以及出现的次数存到全局表中5.根据出现次数从大到小排序数据6.打印出结果实现1.设置一个全局表来存放每一个出现过的单词和它的出现次数这里用TreeMap方便我们排序private sta...
复制链接

扫一扫