最近一直在学coursera上面web intelligence and big data这门课,上周五印度老师布置了一个家庭作业,要求写一个mapreduce程序,用python来实现。
具体描述如下:
Programming Assignment for HW3
Homework 3 (Programming Assignment A)
Download data files bundled as a .zip file from hw3data.zip
Each file in this archive contains entries that look like:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.
that represent bibliographic information about publications, formatted as follows:
paper-id:::author1::author2::…. ::authorN:::title
Your task is to compute how many times every term occurs across titles, for each author.
For example, the author Alberto Pettorossi the following terms occur in titles with the indicated cumulative frequencies (across all his papers): program:3, transformation:2, transforming:2, using:2, programs:2, and logic:2.
Remember that an author might have written multiple papers, which might be listed in multiple files. Further notice that ‘terms’ must exclude common stop-words, such as prepositions etc. For the purpose of this assignment, the stop-words that need to be omitted are listed in the script stopwords.py. In addition, single letter words, such as "a" can be ignored; also hyphens can be ignored (i.e. deleted). Lastly, periods, commas, etc. need to be ignored; in other words, only alphabets and numbers can be part of a title term: Thus, “program” and “program.” should both be counted as the term ‘program’, and "map-reduce" should be taken as 'map reduce'. Note: You do not need to do stemming, i.e. "algorithm" and "algorithms" can be treated as separate terms.
The assignment is to write a parallel map-reduce program for the above task using either octo.py, or mincemeat.py, each of which is a lightweight map-reduce implementation written in Python.
These are available from http://code.google.com/p/octopy/ and mincemeat.py-zipfile respectively.
I strongly recommend mincemeat.py which is much faster than Octo,py even though the latter was covered first in the lecture video as an example. Both are very similar.Once you have computed the output, i.e. the terms-frequencies per author, go attempt Homework 3 where you will be asked questions that can be simply answered using your computed output, such as the top terms that occur for some particular author.
Note: There is no need to submit the code; I assume you will experiment using octo.py to learn how to program using map-reduce. Of course, you can always write a serial program for the task at hand, but then you won’t learn anything about map-reduce.
Lastly, please note that octo.py is a rather inefficient implementation of map-reduce. Some of you might want to delve into the code to figure out exactly why. At the same time, this inefficiency is likely to amplify any errors you make in formulating the map and reduce functions for the task at hand. So if your code starts taking too long, say more than an hour to run, there is probably something wrong.
很显然,如果要让我们从头写起那一周左右时间显然不够,老师在作业中强烈推荐我们使用mincemeat.py这个库,使用这个库以后我们真正要修改的是datasource,mapfn,reducefn这三个参数而已。总之,老师是考验我们修改代码的能力,所以他才说这个作业没有交的必要。
下面来说说实现过程:
1、下载并分析文件
打开以后能发现很多文件,像这样:
每一个文件里面内容都是题目说的那样:
books/bc/tanselCGSS93/Tuzhilin93:::Alexander Tuzhilin:::Applications of temporal Databases to Knowledge-based Simulations.
books/aw/kimL89/DiederichM89:::Jim Diederich::Jack Milton:::Objects, Messages, and Rules in Database Design.
.................................................
.................................................
.................................................
前面蓝色的部分不用考虑,我们只需要提取出红色部分和后面的墨绿色部分,注意红色部分如老师说的作者可能不止一个,所以要对字符串先按":::"分割,再对第二个字符串按"::"分割才行。最后那个title也并不是一个整体,老师要求的是提取出他们当中的每个单词。最终的匹配要求是这样:
author: word-number 其中number就是某个单词在这个作者title中出现的次数,明确了要求,接下来就是改代码了。
2、修改example.py
下载mincemeat压缩包的时候,我们能得到一个example.py文件,我们只需要照着这个文件的格式修改即可:
example.py文件如下:
#!/usr/bin/env python
import mincemeat
data = ["Humpty Dumpty sat on a wall",
"Humpty Dumpty had a great fall",
"All the King's horses and all the King's men",
"Couldn't put Humpty together again",
]
# The data source can be any dictionary-like object
datasource = dict(enumerate(data))
#need change
def mapfn(k, v):
for w in v.split():
yield w, 1
#need change
def reducefn(k, vs):
result = sum(vs)
return result
s = mincemeat.Server()
s.datasource = datasource #need change
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
print results
只需要修改mapfn() reducefn()和datasource,其中datasource是一个dict类型,我之前的想法是存在author:word对,但是后来发现这样mapreduce就不需要做什么事情了,我已经替mapfn()找出了所以的author:word,这肯定不是老师所希望的,所以真正传入的应该是文件内容,然后交给mapfn()去读取文件里面值,建立author:word对。reducefn()负责统计这些author:word对的次数。当搞清楚了这些,改写也变得容易了,这里参考了 zdw12242 童鞋的代码,在此表示感谢,之前我写的代码有点惨不忍睹。
# -*- coding: utf-8 -*-
#!/usr/bin/env python
import glob
import mincemeat
import operator
text_files=glob.glob('E:\\Web\\hw3data\\/*')
def file_contents(file_name):
f=open(file_name)
try:
return f.read()
finally:
f.close()
source=dict((file_name,file_contents(file_name))
for file_name in text_files)
# setup map and reduce functions
def mapfn(key, value):
stop_words=['all', 'herself', 'should', 'to', 'only', 'under', 'do', 'weve',
'very', 'cannot', 'werent', 'yourselves', 'him', 'did', 'these',
'she', 'havent', 'where', 'whens', 'up', 'are', 'further', 'what',
'heres', 'above', 'between', 'youll', 'we', 'here', 'hers', 'both',
'my', 'ill', 'against', 'arent', 'thats', 'from', 'would', 'been',
'whos', 'whom', 'themselves', 'until', 'more', 'an', 'those', 'me',
'myself', 'theyve', 'this', 'while', 'theirs', 'didnt', 'theres',
'ive', 'is', 'it', 'cant', 'itself', 'im', 'in', 'id', 'if', 'same',
'how', 'shouldnt', 'after', 'such', 'wheres', 'hows', 'off', 'i',
'youre', 'well', 'so', 'the', 'yours', 'being', 'over', 'isnt',
'through', 'during', 'hell', 'its', 'before', 'wed', 'had', 'lets',
'has', 'ought', 'then', 'them', 'they', 'not', 'nor', 'wont',
'theyre', 'each', 'shed', 'because', 'doing', 'some', 'shes',
'our', 'ourselves', 'out', 'for', 'does', 'be', 'by', 'on',
'about', 'wouldnt', 'of', 'could', 'youve', 'or', 'own', 'whats',
'dont', 'into', 'youd', 'yourself', 'down', 'doesnt', 'theyd',
'couldnt', 'your', 'her', 'hes', 'there', 'hed', 'their', 'too',
'was', 'himself', 'that', 'but', 'hadnt', 'shant', 'with', 'than',
'he', 'whys', 'below', 'were', 'and', 'his', 'wasnt', 'am', 'few',
'mustnt', 'as', 'shell', 'at', 'have', 'any', 'again', 'hasnt',
'theyll', 'no', 'when','other', 'which', 'you', 'who', 'most',
'ours ', 'why', 'having', 'once','a','-','.',',']
for line in value.splitlines():
word=line.split(':::')
authors=word[1].split('::')
title=word[2]
for author in authors:
for term in title.split():
if term not in stop_words:
if term.isalnum():
yield author,term.lower()
elif len(term)>1:
temp=''
for ichar in term:
if ichar.isalpha() or ichar.isdigit():
temp+=ichar
elif ichar=='-':
temp+=' '
yield author,temp.lower()
def reducefn(key, value):
terms = value
result={}
for term in terms:
if term in result:
result[term]+=1
else:
result[term]=1
return result
# start the server
s = mincemeat.Server()
s.datasource = source
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
#print results
result_file=open('hw3_result.txt','w')
sorted(results.iteritems(), key=operator.itemgetter(1))
for result in results:
result_file.write(result+' : ')
for term in results[result]:
result_file.write(term+':'+str(results[result][term])+'#')
result_file.write('\r\n')
result_file.close()
3、运行
打开两个cmd,一个运行自己的程序,一个运行mincemeat.py程序,等于一个做服务器,一个做客户端,当服务器处理完以后,就能看到结果了。界面如下:
至于其中原理,应该是模拟mapreduce来实现的,可以看看这篇文章:
输出文件内容是这样的:
José Cristóbal Riquelme Santos : evolutionary:1#numeric:2#selection:1#reglas:1#clasificacin:1#efficient:1#feature:1#induccin:1#mediante:1#discovering:1#soap:1#rules:1#de:3#evolutivo:1#oblicuas:1#association:1#algoritmo:1#algorithm:1#via:1#un:1#mtodo:1#attributes:1#
Larry L. Kinney : control:1#microprogrammed:1#testing:1#number:1#detection:1#registers:1#feedback:1#group:1#strategy:1#intrainverted:1#units:1#evaluation:1#method:1#linear:1#concurrent:1#probing:1#relating:1#chips:1#a:1#cyclic:1#shift:1#large:1#behavior:1#error:1#
Gianfranco Bilardi : operations:1#logp:1#characterization:1#fat trees:1#computation:1#its:1#functions:1#for:1#locality:1#temporal:1#memory:1#hierarchies:1#across:1#versus:1#monotone:1#bsp:1#broadcast:1#a:1#lower:1#crew pram:1#portability:1#of:1#bounds:1#time:1#associative:1#
Joseph C. Culberson : binary:1#search:1#extended:1#polygons:1#polygon:1#orthogonal:1#simple:1#abstract:1#uncertainty:1#searching:1#trees:1#number:1#minimum:1#orthogonally:1#updates:1#convex:1#effect:1#covering:1#the:1#
.....................................
.....................................
.....................................
4、后记
这里有一点遗憾的是,对results按单词出现次数的多少排序没有效果,我觉得原因是把word:number共同作为了一个字典中的一个value值,暂时还不清楚如何对dict中value值得某一项进行排序,于是在完成第三周作业的时候难免要费眼力去找数字最大的两个值。
最后不得不说,最近已经碰到好几次要用python写东西,看来自己是难逃python的手掌心了。
5、补充:
感谢zdw12242的修改
对每个作者的词项频率排序,mapfn不变,reducefn及主程序改变后如下:
def reducefn(key, value):
terms = value
counts={}
for term in terms:
if term in counts:
counts[term]+=1
else:
counts[term]=1
items=counts.items() # sort the counts
reverse_items=[ [v[1],v[0]] for v in items]
reverse_items.sort(reverse=True)
result=[]
for i in reverse_items:
result.append([i[1],i[0]])
return result
# start the server
s = mincemeat.Server()
s.datasource = source
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
#save results
result_file=open('hw3_result_sorted','w')
for result in results:
result_file.write(result+' : ')
for term in results[result]:
result_file.write(term[0]+':'+str(term[1])+',')
result_file.write('\n')
result_file.close()
因为字典类型在python中是以哈希表的方式存储的,其输出结果是乱序的,所以将字典转为列表后排序输出。
先将字典转化成列表:
[['creating', 2], ['lifelong', 1],['quality', 3],['learners', 5], ['assurance', 1]]
然后交换列表中每个子列表前后的值:
[[2, 'creating'], [1, 'lifelong'],[3, 'quality'],[5, 'learners'], [1, 'assurance']]
按照列表中子列表第一个值从大到小排序:
[[5, 'learners'], [3, 'quality'], [2, 'creating'], [1, 'lifelong'], [1, 'assurance']]
最后将列表归位:
[['learners', 5], ['quality', 3], ['creating', 2], ['lifelong', 1], ['assurance', 1]]