spark+python学习笔记感想

最新推荐文章于 2021-04-18 17:59:43 发布

yza而已

最新推荐文章于 2021-04-18 17:59:43 发布

阅读量1.7k

点赞数

本文链接：https://blog.csdn.net/sinat_29373157/article/details/50296091

版权

上周终于把coursera上的一门数据课程结束了，并且通过了，周四根据课程assignment做的实验印象很深，觉得有必要记下来。

Hadoop Platform and Application Framework

by University of California, San Diego

https://www.coursera.org/learn/hadoop/home/welcome

1.Streaming

一开始是学习一个简单的streaming 用 hadoop的streaming.jar+python，

streaming的相关知识在这里（感谢）: http://dongxicheng.org/mapreduce/hadoop-streaming-programming/

用法：

  > hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ 
 
    -input /user/cloudera/input \ 
 
    -output /user/cloudera/output_join \    
 
    -mapper /home/cloudera/join1_mapper.py \    
 
      -reducer /home/cloudera/join1_reducer.py

但是这里还要注意的是，在我的运行环境下还要加两条指令（hadoop2.6.0）

-file /home/...... -file/home/..... 即光把mapper 和reducer的文件地址写出来还不够，还要在后面用两次-file定位文件位置才能使streaming正确运行

对应复制以下代码

d) The code and text files below are

join1_mapper.py

join1_reducer.py

join1_FileA.txt

join1_FileB.txt

make_join2data.py

make_data_join2.txt (a short command line script)

    #!/usr/bin/env python 
   
 import sys 
   
 # -------------------------------------------------------------------------- 
   
 #This mapper code will input a <date word, value> input file, and move date into  
   
 #  the value field for output 
   
 #  
   
 #  Note, this program is written in a simple style and does not full advantage of Python  
   
 #     data structures,but I believe it is more readable 
   
 # 
   
 #  Note, there is NO error checking of the input, it is assumed to be correct 
   
 #     meaning no extra spaces, missing inputs or counts,etc.. 
   
 # 
   
 # See #  see https://docs.python.org/2/tutorial/index.html for details  and python  tutorials 
   
 # 
   
 # -------------------------------------------------------------------------- 
   
 for line in sys.stdin: 
   
     line       = line.strip()   #strip out carriage return 
   
     key_value  = line.split(",")   #split line, into key and value, returns a list 
   
     key_in     = key_value[0].split(" ")   #key is first item in list 
   
     value_in   = key_value[1]   #value is 2nd item  
   
     #print key_in 
   
     if len(key_in)>=2:           #if this entry has <date word> in key 
   
         date = key_in[0]      #now get date from key field 
   
         word = key_in[1] 
   
         value_out = date+" "+value_in     #concatenate date, blank, and value_in 
   
         print( '%s\t%s' % (word, value_out) )  #print a string, tab, and string 
   
     else:   #key is only <word> so just pass it through 
   
         print( '%s\t%s' % (key_in[0], value_in) )  #print a string tab and string 
   
 #Note that Hadoop expects a tab to separate key value 
   
 #but this program assumes the input file has a ',' separating key value

  #!/usr/bin/env python 
 
 import sys 
 
 # -------------------------------------------------------------------------- 
 
 #This reducer code will input a <word, value> input file, and join words together 
 
 # Note the input will come as a group of lines with same word (ie the key) 
 
 # As it reads words it will hold on to the value field 
 
 # 
 
 # It will keep track of current word and previous word, if word changes 
 
 #   then it will perform the 'join' on the set of held values by merely printing out  
 
 #   the word and values.  In other words, there is no need to explicitly match keys b/c 
 
 #   Hadoop has already put them sequentially in the input  
 
 #    
 
 # At the end it will perform the last join 
 
 # 
 
 # 
 
 #  Note, there is NO error checking of the input, it is assumed to be correct, meaning 
 
 #   it has word with correct and matching entries, no extra spaces, etc. 
 
 # 
 
 #  see https://docs.python.org/2/tutorial/index.html for python tutorials 
 
 # 
 
 #  San Diego Supercomputer Center copyright 
 
 # -------------------------------------------------------------------------- 
 
 prev_word          = "  "                #initialize previous word  to blank string 
 
 months             = ['Jan','Feb','Mar','Apr','Jun','Jul','Aug','Sep','Nov','Dec'] 
 
 dates_to_output    = [] #an empty list to hold dates for a given word 
 
 day_cnts_to_output = [] #an empty list of day counts for a given word 
 
 # see https://docs.python.org/2/tutorial/datastructures.html for list details 
 
 line_cnt           = 0  #count input lines 
 
 for line in sys.stdin: 
 
     line       = line.strip()       #strip out carriage return 
 
     key_value  = line.split('\t')   #split line, into key and value, returns a list 
 
     line_cnt   = line_cnt+1      
 
     #note: for simple debugging use print statements, ie:  
 
     curr_word  = key_value[0]         #key is first item in list, indexed by 0 
 
     value_in   = key_value[1]         #value is 2nd item 
 
     #----------------------------------------------------- 
 
     # Check if its a new word and not the first line  
 
     #   (b/c for the first line the previous word is not applicable) 
 
     #   if so then print out list of dates and counts 
 
     #---------------------------------------------------- 
 
     if curr_word != prev_word: 
 
         # -----------------------      
 
         #now write out the join result, but not for the first line input 
 
         # ----------------------- 
 
         if line_cnt>1: 
 
             for i in range(len(dates_to_output)):  #loop thru dates, indexes start at 0 
 
                  print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt)) 
 
             #now reset lists 
 
             dates_to_output   =[] 
 
             day_cnts_to_output=[] 
 
         prev_word         =curr_word  #set up previous word for the next set of input lines 
 
     # --------------------------------------------------------------- 
 
     #whether or not the join result was written out,  
 
     #   now process the curr word    
 
     #determine if its from file <word, total-count> or < word, date day-count> 
 
     # and build up list of dates, day counts, and the 1 total count 
 
     # --------------------------------------------------------------- 
 
     if (value_in[0:3] in months):  
 
         date_day =value_in.split() #split the value field into a date and day-cnt 
 
         #add date to lists of the value fields we are building 
 
         dates_to_output.append(date_day[0]) 
 
         day_cnts_to_output.append(date_day[1]) 
 
     else: 
 
         curr_word_total_cnt = value_in  #if the value field was just the total count then its 
 
                                            #the first (and only) item in this list 
 
 # --------------------------------------------------------------- 
 
 #now write out the LAST join result 
 
 # --------------------------------------------------------------- 
 
 for i in range(len(dates_to_output)):  #loop thru dates, indexes start at 0 
 
          print('{0} {1} {2} {3}'.format(dates_to_output[i],prev_word,day_cnts_to_output[i],curr_word_total_cnt))

  able,991 
 
 about,11 
 
 burger,15 
 
 actor,22

  Jan-01 able,5 
 
 Feb-02 about,3 
 
 Mar-03 about,8 
 
 Apr-04 able,13 
 
 Feb-22 actor,3 
 
 Feb-23 burger,5 
 
 Mar-08 burger,2 
 
 Dec-15 able,100

  #!/usr/bin/env python 
 
 import sys 
 
 # -------------------------------------------------------------------------- 
 
 #  (make_join2data.py) Generate a random combination of titles and viewer counts, or channels 
 
 # this is a simple version of a congruential generator,  
 
 #   not a great random generator but enough  
 
 # -------------------------------------------------------------------------- 
 
 chans   = ['ABC','DEF','CNO','NOX','YES','CAB','BAT','MAN','ZOO','XYZ','BOB'] 
 
 sh1 =['Hot','Almost','Hourly','PostModern','Baked','Dumb','Cold','Surreal','Loud'] 
 
 sh2 =['News','Show','Cooking','Sports','Games','Talking','Talking'] 
 
 vwr =range(17,1053) 
 
 chvnm=sys.argv[1]  #get number argument, if its n, do numbers not channels, 
 
 lch=len(chans) 
 
 lsh1=len(sh1) 
 
 lsh2=len(sh2) 
 
 lvwr=len(vwr) 
 
 ci=1 
 
 s1=2 
 
 s2=3 
 
 vwi=4 
 
 ri=int(sys.argv[3]) 
 
 for i in range(0,int(sys.argv[2])):  #arg 2 is the number of lines to output 
 
     if chvnm=='n':  #no numuber 
 
         print('{0}_{1},{2}'.format(sh1[s1],sh2[s2],chans[ci])) 
 
     else: 
 
         print('{0}_{1},{2}'.format(sh1[s1],sh2[s2],vwr[vwi]))  
 
     ci=(5*ci+ri) % lch    
 
     s1=(4*s1+ri) % lsh1 
 
     s2=(3*s1+ri+i) % lsh2 
 
     vwi=(2*vwi+ri+i) % lvwr 
 
     if (vwi==4): vwi=5

python make_join2data.py y 1000 13 > join2_gennumA.txt
python make_join2data.py y 2000 17 > join2_gennumB.txt
python make_join2data.py y 3000 19 > join2_gennumC.txt
python make_join2data.py n 100  23 > join2_genchanA.txt
python make_join2data.py n 200  19 > join2_genchanB.txt
python make_join2data.py n 300  37 > join2_genchanC.txt

执行命令

> sh make_data_join2.txt

生成6个文件，也就是我们要用的文本。。。（我streaming在mapping阶段老是出错，还没找出错误来，所以后面的都是用spark做的）

记得将这6个文件上传到HDFS上

2.使用spark 的join（）做实验

首先要启动hadoop

./sbin/start-al.sh

然后启动spark

./sbin/start-all.sh

启动pyspark

./bin/pyspark

assignment里面要求得到BAT频道的所有观众人数

首先要读取HDFS上的文本（路径根据自己的写）

  show_views_file = sc.textFile("input/join2_gennum?.txt") 

可以测试你读取的数据

  show_views_file.take(2) 
 
 

will return the first 2 elements of the dataset:

    [u'Hourly_Sports,21', u'PostM 
  
    odern_Show,38']

然后是分出节目和观众数

def split_show_views(line)

key_value = line.split(",")

show = key_value[0]

views = key_value[1]

return (show,views)

然后show_views = show_views_file.map(split_show_views)

可以用show_views.collect()查看数据( 返回 (u'节目',u'数字'),... )

·读取channel数据

  show_channel_file = sc.textFile("input/join2_genchan?.txt") 

分列出数据，返回节目，频道

def split_show_views(line)

key_value = line.split(",")

show = key_value[0]

channel = key_value[1]

return (show,channel)

>>show_channel = show_channel_file.map(split_show_views)

·将两个数据集show_views和show_channel 用join（）连接起来

joined_dataset = show_channel.join(show_views)

完成之后可以用.collect(）查看

输出的是 (u'节目',(u'频道',u'人数')),......

注意这里： “节目”的位置是【0】，“频道，人数”是第【1】个位置，而“频道”是【1】【0】的位置，“人数”是【1】【1】个位置

·求BAT频道的人数

思路是挖掘“频道“作为key（extract channel as key）,就是要[1][0]的频道和【1】【1】的人数

def extract_channel_views(show_views_channel)

key_values = show_views_channel

wtf = key_value[1]

channel = wtf[0]

views = wtf[1]

views = int(views)

return(channel,views)

>>channel_views = joined_dataset.map(extract_channel_views )

def sum_channel(a,b)

return a+b

>>channel_views.reduceByKey(sum_channel).collect() 之后可以看到各个频道的总人数

输出：

[(u'BOB',2591062),(u'NOX',2583583), (u'CAB',3940862), (u'ABC',1115974), (u'MAN',6566187), (u'BAT',5099141), (u'XYZ',5208016), (u'DEF',8032799), (u'CNO',3941177) ]

就能找到看BAT频道的总人数

assignment2：找出ABC频道的各个节目的观看人数

一开始课程里面要求用的python的mapper.py和reducer.py然后streaming出结果，但是不知道我mapper什么地方写的不对，在map到33%的时候就开始报错。。。

在学过spark的课程后，突然感觉用spark 做这个也能做出来而且很容易，做了一会果然出了结果

1.利用之前的joined_dataset，与extract channel一样我写了 extract_ABC_views

def extract_ABC_views(show_views_channel)

key_values = show_views_channel

titles =key_value[0]

wtf = key_value[1]

channel = wtf[0]

views = wtf[1]

views = int(views)

if (channel == "ABC"):

return (titles,views)

else:

return (0,0)

这样就能解决问题,解释下return(0,0) 这里我用collect（）看了之后，看到确实是ABC频道的相关数据都返回来了，但是其他频道的返回值是Nope 这就造成了之后的reduceByKey方法不能用，因为这个Nope，所以我就想到了把这些Nope改成（0，0）这样对结果就没有影响了，也能得到我要的结果

>>ABC_views = joined_dataset.map(extract_ABC_views)

>>ABC_views.reduceByKey(sum_channel).collect()

得到ABC各个节目的观众人数

继续学习。。。。。