上周终于把coursera上的一门数据课程结束了,并且通过了,周四根据课程assignment做的实验印象很深,觉得有必要记下来。
Hadoop Platform and Application Framework
by University of California, San Diego
1.Streaming
一开始是学习一个简单的streaming 用 hadoop的streaming.jar+python,
streaming的相关知识在这里(感谢):
http://dongxicheng.org/mapreduce/hadoop-streaming-programming/
用法:
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_join \
-mapper /home/cloudera/join1_mapper.py \
-input /user/cloudera/input \
-output /user/cloudera/output_join \
-mapper /home/cloudera/join1_mapper.py \
-reducer /home/cloudera/join1_reducer.py
但是这里还要注意的是,在我的运行环境下还要加两条指令(hadoop2.6.0)
-file /home/...... -file/home/..... 即光把mapper 和reducer的文件地址写出来还不够,还要在后面用两次-file定位文件位置才能使streaming正确运行
对应复制以下代码
d) The code and text files below are
join1_mapper.py
join1_reducer.py
join1_FileA.txt
join1_FileB.txt
make_join2data.py
make_data_join2.txt (a short command line script)
#!/usr/bin/env python
import sys
# --------------------------------------------------------------------------
#This mapper code will input a <date word, value> input file, and move date into
# the value field for output
#
# Note, this program is written in a simple style and does not full advantage of Python
# data structures,but I believe it is more readable
#
# Note, there is NO error checking of the input, it is assumed to be correct
# meaning no extra spaces, missing inputs or counts,etc..
#
# See # see https://docs.python.org/2/tutorial/index.html for details and python tutorials
#
# --------------------------------------------------------------------------
for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split(",") #split line, into key and value, returns a list
key_in = key_value[0].split(" ") #key is first item in list
value_in = key_value[1] #value is 2nd item
#print key_in
if len(key_in)>=2: #if this entry has <date word> in key
date = key_in[0] #now get date from key field
word = key_in[1]
value_out = date+" "+value_in #concatenate date, blank, and value_in
print( '%s\t%s' % (word, value_out) ) #print a string, tab, and string
else: #key is only <word> so just pass it through
print( '%s\t%s' % (key_in[0], value_in) ) #print a string tab and string
#Note that Hadoop expects a tab to separate key value
#but this program assumes the input file has a ',' separating key value
import sys
# --------------------------------------------------------------------------
#This mapper code will input a <date word, value> input file, and move date into
# the value field for output
#
# Note, this program is written in a simple style and does not full advantage of Python
# data structures,but I believe it is more readable
#
# Note, there is NO error checking of the input, it is assumed to be correct
# meaning no extra spaces, missing inputs or counts,etc..
#
# See # see https://docs.python.org/2/tutorial/index.html for details and python tutorials
#
# --------------------------------------------------------------------------
for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split(",") #split line, into key and value, returns a list
key_in = key_value[0].split(" ") #key is first item in list
value_in = key_value[1] #value is 2nd item
#print key_in
if len(key_in)>=2: #if this entry has <date word> in key
date = key_in[0] #now get date from key field
word = key_in[1]
value_out = date+" "+value_in #concatenate date, blank, and value_in
print( '%s\t%s' % (word, value_out) ) #print a string, tab, and string
else: #key is only <word> so just pass it through
print( '%s\t%s' % (key_in[0], value_in) ) #print a string tab and string
#Note that Hadoop expects a tab to separate key value
#but this program assumes the input file has a ',' separating key value
#!/usr/bin/env python
import sys
# --------------------------------------------------------------------------
#This reducer code will input a <word, value> input file, and join words together
# Note the input will come as a group of lines with same word (ie the key)
# As it reads words it will hold on to the value field
#
# It will keep track of current word and previous word, if word changes
# then it will perform the 'join' on the set of held values by merely printing out
# the word and values. In other words, there is no need to explicitly match keys b/c
# Hadoop has already put them sequentially in the input
#
# At the end it will perform the last join
#
#
# Note, there is NO error checking of the input, it is assumed to be correct, meaning
# it has word with correct and matching entries, no extra spaces, etc.
#
# see https://docs.python.org/2/tutorial/index.html for python tutorials
#
# San Diego Supercomputer Center copyright
# --------------------------------------------------------------------------
prev_word = " " #initialize previous word to blank string
months = ['Jan','Feb','Mar','Apr','Jun','Jul','Aug','Sep','Nov','Dec']
dates_to_output = [] #an empty list to hold dates for a given word
day_cnts_to_output = [] #an empty list of day counts for a given word
# see https://docs.python.org/2/tutorial/datastructures.html for list details
line_cnt = 0 #count input lines
for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split('\t') #split line, into key and value, returns a list
line_cnt = line_cnt+1
#note: for simple debugging use print statements, ie:
curr_word = key_value[0] #key is first item in list, indexed by 0
value_in = key_value[1] #value is 2nd item
#-----------------------------------------------------
# Check if its a new word and not the first line
# (b/c for the first line the previous word is not applicable)
# if so then print out list of dates and counts
#----------------------------------------------------
if curr_word != prev_word:
# -----------------------
#now write out the join result, but not for the first line input
# -----------------------
if line_cnt>1:
for i in range(len(dates_to_output)): #loop thru dates, indexes start at 0
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
#now reset lists
dates_to_output =[]
day_cnts_to_output=[]
prev_word =curr_word #set up previous word for the next set of input lines
# ---------------------------------------------------------------
#whether or not the join result was written out,
# now process the curr word
#determine if its from file <word, total-count> or < word, date day-count>
# and build up list of dates, day counts, and the 1 total count
# ---------------------------------------------------------------
if (value_in[0:3] in months):
date_day =value_in.split() #split the value field into a date and day-cnt
#add date to lists of the value fields we are building
dates_to_output.append(date_day[0])
day_cnts_to_output.append(date_day[1])
else:
curr_word_total_cnt = value_in #if the value field was just the total count then its
#the first (and only) item in this list
# ---------------------------------------------------------------
#now write out the LAST join result
# ---------------------------------------------------------------
for i in range(len(dates_to_output)): #loop thru dates, indexes start at 0
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
import sys
# --------------------------------------------------------------------------
#This reducer code will input a <word, value> input file, and join words together
# Note the input will come as a group of lines with same word (ie the key)
# As it reads words it will hold on to the value field
#
# It will keep track of current word and previous word, if word changes
# then it will perform the 'join' on the set of held values by merely printing out
# the word and values. In other words, there is no need to explicitly match keys b/c
# Hadoop has already put them sequentially in the input
#
# At the end it will perform the last join
#
#
# Note, there is NO error checking of the input, it is assumed to be correct, meaning
# it has word with correct and matching entries, no extra spaces, etc.
#
# see https://docs.python.org/2/tutorial/index.html for python tutorials
#
# San Diego Supercomputer Center copyright
# --------------------------------------------------------------------------
prev_word = " " #initialize previous word to blank string
months = ['Jan','Feb','Mar','Apr','Jun','Jul','Aug','Sep','Nov','Dec']
dates_to_output = [] #an empty list to hold dates for a given word
day_cnts_to_output = [] #an empty list of day counts for a given word
# see https://docs.python.org/2/tutorial/datastructures.html for list details
line_cnt = 0 #count input lines
for line in sys.stdin:
line = line.strip() #strip out carriage return
key_value = line.split('\t') #split line, into key and value, returns a list
line_cnt = line_cnt+1
#note: for simple debugging use print statements, ie:
curr_word = key_value[0] #key is first item in list, indexed by 0
value_in = key_value[1] #value is 2nd item
#-----------------------------------------------------
# Check if its a new word and not the first line
# (b/c for the first line the previous word is not applicable)
# if so then print out list of dates and counts
#----------------------------------------------------
if curr_word != prev_word:
# -----------------------
#now write out the join result, but not for the first line input
# -----------------------
if line_cnt>1:
for i in range(len(dates_to_output)): #loop thru dates, indexes start at 0
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
#now reset lists
dates_to_output =[]
day_cnts_to_output=[]
prev_word =curr_word #set up previous word for the next set of input lines
# ---------------------------------------------------------------
#whether or not the join result was written out,
# now process the curr word
#determine if its from file <word, total-count> or < word, date day-count>
# and build up list of dates, day counts, and the 1 total count
# ---------------------------------------------------------------
if (value_in[0:3] in months):
date_day =value_in.split() #split the value field into a date and day-cnt
#add date to lists of the value fields we are building
dates_to_output.append(date_day[0])
day_cnts_to_output.append(date_day[1])
else:
curr_word_total_cnt = value_in #if the value field was just the total count then its
#the first (and only) item in this list
# ---------------------------------------------------------------
#now write out the LAST join result
# ---------------------------------------------------------------
for i in range(len(dates_to_output)): #loop thru dates, indexes start at 0
print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))
able,991
about,11
burger,15
actor,22
about,11
burger,15
actor,22
Jan-01 able,5
Feb-02 about,3
Mar-03 about,8
Apr-04 able,13
Feb-22 actor,3
Feb-23 burger,5
Mar-08 burger,2
Dec-15 able,100
Feb-02 about,3
Mar-03 about,8
Apr-04 able,13
Feb-22 actor,3
Feb-23 burger,5
Mar-08 burger,2
Dec-15 able,100
#!/usr/bin/env python
import sys
# --------------------------------------------------------------------------
# (make_join2data.py) Generate a random combination of titles and viewer counts, or channels
# this is a simple version of a congruential generator,
# not a great random generator but enough
# --------------------------------------------------------------------------
chans = ['ABC','DEF','CNO','NOX','YES','CAB','BAT','MAN','ZOO','XYZ','BOB']
sh1 =['Hot','Almost','Hourly','PostModern','Baked','Dumb','Cold','Surreal','Loud']
sh2 =['News','Show','Cooking','Sports','Games','Talking','Talking']
vwr =range(17,1053)
chvnm=sys.argv[1] #get number argument, if its n, do numbers not channels,
lch=len(chans)
lsh1=len(sh1)
lsh2=len(sh2)
lvwr=len(vwr)
ci=1
s1=2
s2=3
vwi=4
ri=int(sys.argv[3])
for i in range(0,int(sys.argv[2])): #arg 2 is the number of lines to output
if chvnm=='n': #no numuber
print('{0}_{1},{2}'.format(sh1[s1],sh2[s2],chans[ci]))
else:
print('{0}_{1},{2}'.format(sh1[s1],sh2[s2],vwr[vwi]))
ci=(5*ci+ri) % lch
s1=(4*s1+ri) % lsh1
s2=(3*s1+ri+i) % lsh2
vwi=(2*vwi+ri+i) % lvwr
if (vwi==4): vwi=5
import sys
# --------------------------------------------------------------------------
# (make_join2data.py) Generate a random combination of titles and viewer counts, or channels
# this is a simple version of a congruential generator,
# not a great random generator but enough
# --------------------------------------------------------------------------
chans = ['ABC','DEF','CNO','NOX','YES','CAB','BAT','MAN','ZOO','XYZ','BOB']
sh1 =['Hot','Almost','Hourly','PostModern','Baked','Dumb','Cold','Surreal','Loud']
sh2 =['News','Show','Cooking','Sports','Games','Talking','Talking']
vwr =range(17,1053)
chvnm=sys.argv[1] #get number argument, if its n, do numbers not channels,
lch=len(chans)
lsh1=len(sh1)
lsh2=len(sh2)
lvwr=len(vwr)
ci=1
s1=2
s2=3
vwi=4
ri=int(sys.argv[3])
for i in range(0,int(sys.argv[2])): #arg 2 is the number of lines to output
if chvnm=='n': #no numuber
print('{0}_{1},{2}'.format(sh1[s1],sh2[s2],chans[ci]))
else:
print('{0}_{1},{2}'.format(sh1[s1],sh2[s2],vwr[vwi]))
ci=(5*ci+ri) % lch
s1=(4*s1+ri) % lsh1
s2=(3*s1+ri+i) % lsh2
vwi=(2*vwi+ri+i) % lvwr
if (vwi==4): vwi=5
python make_join2data.py y 1000 13 > join2_gennumA.txt python make_join2data.py y 2000 17 > join2_gennumB.txt python make_join2data.py y 3000 19 > join2_gennumC.txt python make_join2data.py n 100 23 > join2_genchanA.txt python make_join2data.py n 200 19 > join2_genchanB.txt python make_join2data.py n 300 37 > join2_genchanC.txt
执行命令
> sh make_data_join2.txt
生成6个文件 ,也就是我们要用的文本。。。(我streaming在mapping阶段老是出错,还没找出错误来,所以后面的都是用spark做的)
记得将这6个文件上传到HDFS上
2.使用spark 的join()做实验
首先要启动hadoop
./sbin/start-al.sh
然后启动spark
./sbin/start-all.sh
启动pyspark
./bin/pyspark
assignment里面要求得到BAT频道的所有观众人数
首先要读取HDFS上的文本(路径根据自己的写)
show_views_file = sc.textFile("input/join2_gennum?.txt")
可以测试你读取的数据
show_views_file.take(2)
will return the first 2 elements of the dataset:
[u'Hourly_Sports,21', u'PostM
odern_Show,38']
然后是分出节目和观众数
def split_show_views(line)
key_value = line.split(",")
show = key_value[0]
views = key_value[1]
return (show,views)
然后show_views = show_views_file.map(split_show_views)
可以用show_views.collect()查看数据( 返回
(u'节目',u'数字'),... )
·读取channel数据
show_channel_file = sc.textFile("input/join2_genchan?.txt")
分列出数据,返回节目,频道
def split_show_views(line)
key_value = line.split(",")
show = key_value[0]
channel = key_value[1]
return (show,channel)
>>show_channel = show_channel_file.map(split_show_views)
·将两个数据集show_views和show_channel 用join()连接起来
joined_dataset = show_channel.join(show_views)
完成之后可以用.collect()查看
输出的是
(u'节目',(u'频道',u'人数')),......
注意这里: “节目”的位置是【0】 ,“频道,人数”是第【1】个位置 ,而“频道”是【1】【0】的位置 ,“人数”是【1】【1】个位置
·求BAT频道的人数
思路是挖掘“频道“作为key(extract channel as key),就是要[1][0]的频道 和【1】【1】的人数
def extract_channel_views(show_views_channel)
key_values = show_views_channel
wtf = key_value[1]
channel = wtf[0]
views = wtf[1]
views = int(views)
return(channel,views)
>>channel_views = joined_dataset.map(extract_channel_views )
def sum_channel(a,b)
return a+b
>>channel_views.reduceByKey(sum_channel).collect() 之后可以看到各个频道的总人数
输出:
[(u'BOB',2591062),(u'NOX',2583583), (u'CAB',3940862), (u'ABC',1115974), (u'MAN',6566187), (u'BAT',5099141), (u'XYZ',5208016), (u'DEF',8032799), (u'CNO',3941177) ]
就能找到看BAT频道的总人数
assignment2:找出ABC频道的各个节目的观看人数
一开始课程里面要求用的python的mapper.py和reducer.py然后streaming出结果,但是不知道我mapper什么地方写的不对,在map到33%的时候就开始报错。。。
在学过spark的课程后,突然感觉用spark 做这个也能做出来而且很容易,做了一会果然出了结果
1.利用之前的joined_dataset,与extract channel一样 我写了 extract_ABC_views
def extract_ABC_views(show_views_channel)
key_values = show_views_channel
titles =key_value[0]
wtf = key_value[1]
channel = wtf[0]
views = wtf[1]
views = int(views)
if (channel == "ABC"):
return (titles,views)
else:
return (0,0)
这样就能解决问题,解释下return(0,0) 这里我用collect()看了之后,看到确实是ABC频道的相关数据都返回来了,但是其他频道的返回值是Nope 这就造成了之后的reduceByKey方法不能用,因为这个Nope,所以我就想到了把这些Nope改成(0,0)这样对结果就没有影响了 ,也能得到我要的结果
>>ABC_views = joined_dataset.map(extract_ABC_views)
>>ABC_views.reduceByKey(sum_channel).collect()
得到ABC各个节目的观众人数
继续学习。。。。。