Google python course basic exercise——wordcount

在学习了python的基本数据类型及相关语法后,现综合所学list,tuples,str,dict,sort,file等知识,解决以下问题。

题目及代码如下:

 1 # -*- coding: cp936 -*-
 2 """
 3 1. For the --count flag, implement a print_words(filename) function that counts
 4 how often each word appears in the text and prints:
 5 word1 count1
 6 word2 count2
 7 ...
 8 
 9 Print the above list in order sorted by word.Store all the words as lowercase,
10 so 'The' and 'the' count as the same word.
11 
12 2. For the --topcount flag, implement a print_top(filename) which is similar
13 to print_words() but which prints just the top 20 most common words sorted
14 so the most common word is first, then the next most common, and so on.
15 """
16 import sys
17 def word_count(filename):   #first, a helper utility function to get the word/count
18     dict_1 = {}
19     f=open(filename,'r')    #
20     text = ((f.read()).lower()).split() # f.read() return a single string including all words in file,
21                                         #lower() to get the upper the same to lower
22                                         #split() to split on all whitespace
23                                         # text is a list with all words in file seperated 
24     for word in text:                   #get word in text matched into dict_1
25         if word in dict_1.keys():
26             dict_1[word] += 1
27         else:
28             dict_1[word] = 1
29     return dict_1                       #return a dict of {word:count}
30     
31 def print_words(filename):
32     dict_2 = word_count(filename)
33     for k in sorted(dict_2.keys()):   #list in order sorted by keys and common error:sorted() return a 'list' not a 'dict
34         print k,dict_2[k]
35 
36 
37 def print_top(filename):
38     dict_2 = word_count(filename)
39     def get_value(tup):
40         return tup[-1]
41     dict_3 = sorted(dict_2.items(),key=get_value,reverse=True)  #to get the top common words
42                                                                 #items() to get the dict to list of tuples
43     for k,v in dict_3[:20]:
44         print k,v
45 
46 def main():
47                                     #sys.argv[0] is the selfcommand,
48     if len(sys.argv) != 3:          #sys.argv[1] is the optional operation and sys.argv[2] is filename
49         print 'usage: ./wordcount.py {--count | --topcount} file'
50         sys.exit(1)
51 
52     option = sys.argv[1]
53     filename = sys.argv[2]
54     if option == '--count':
55         print_words(filename)
56     elif option == '--topcount':
57         print_top(filename)
58     else:
59         print 'unknown option: ' + option
60         sys.exit(1) 
61 
62 if __name__ == '__main__':
63     main()

结果截图:

(续上图)

word.txt原文粘贴:

Campaigners say the increasing sexualisation of society, 

fuelled by easy access to internet pornography, is behind 

the disturbing figures.

Only yesterday the National Union of Teachers warned that 

sexual equality has been ‘rebranded by big business’ into 

a ‘raunch culture’ which is damaging the way girls view 

themselves.

Playboy bunnies adorn children’s pencil cases, pole dancing 

is portrayed as an ‘empowering’ form of exercise and 

beauty pageants have become a staple of student life, 

delegates said.

Statistics from the Department for Education show that in 

2009/10, there were 3,330 exclusions for sexual misconduct. 

In 2010/11, a further 3,030 children were excluded for the 

same reason.

The 6,000-plus cases cover lewd behaviour, sexual abuse, 

assault, bullying, daubing sexual graffiti, and sexual 

harassment.

The 2010/11 total includes 200 exclusions from primary 

schools: 190 suspensions and ten expulsions.

There have been warnings that the number of expulsions may 

only hint at the true scale of the problem.

England’s deputy children’s commissioner has told MPs that 

head teachers are reluctant to tackle sexual exploitation of 

children for fear of the message it will send out about 

their schools. 

Sue Berelowitz said schools were not facing up to the fact 

that some bullying amounts to sexual violence.

PS:

  1. 本问题实质在于针对所给文件,完成文件中单词的统计信息,并给出top 20,这在网络文本检索中是最基本的功能。
  2. 分析问题1和问题2,发现两者中有重复功能地方,即从file中得到word/count,故考虑写这个辅助函数word_count()。在这个函数中涉及读文本,文本分割,对word计数等知识。
  3. 细节:sorted(dict.keys())返回的是'list'而不是'dict';   dict.items()将dict转化为由(key:value)这样的tuples组成的list
  4. 待改进的地方:对比word.txt与'--count'结果来看,基本实现了文本统计功能,有趣的是文中出现的'rebranded by big business','empowering','raunch culture',children's ,python在split时会将与word连在一起的符号作为一个整体,这样无疑会影响这些word的统计信息及排名,所以函数中还应该加入对标点符号及非字母数字等的处理。
    1 if (word[0].isalnum()) == False:
    2             word = word[1:]
    3 if (word[-1].isalnum()) == False:
    4             word = word[:-1]

    在word_count()函数中加入上面的处理语句即可,注意她保护了中间带符号的如children's,2011/10等类型,处理了类似'empowering 的单词。

转载于:https://www.cnblogs.com/Emma437211/archive/2013/04/02/2995899.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Mapreduce实例-WordCount是一个经典的MapReduce程序,用于统计文本中每个单词出现的次数。它的工作原理是将输入的文本划分为多个片段,每个片段由多个键值对组成,其中键是单词,值是1。然后通过Map阶段将每个片段中的单词提取出来,并将每个单词映射为键值对,其中键是单词,值是1。接下来,通过Shuffle和Sort阶段将具有相同单词的键值对聚集在一起。最后,通过Reduce阶段将相同单词的计数值进行累加,得到每个单词的总次数。 以下是一个示例代码片段,展示了WordCount程序的基本结构和关键组件: ```java import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值