Spark+Python lab2

目标:初探分布式计算平台Spark python API使用。

Extral Tutorial

Create RDD

开始实现创建一个样本RDD

wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
# Print out the type of wordsRDD
print type(wordsRDD)
#output: <class 'pyspark.rdd.RDD'>

Map

重要的transformation函数map的使用

自定义函数:
pluralRDD = wordsRDD.map(makePlural)
print pluralRDD.collect()
#output: ['cats', 'elephants', 'rats', 'rats', 'cats']
Lambda函数:
pluralLambdaRDD = wordsRDD.map(lambda x:x+'s')
print pluralLambdaRDD.collect()
#output:['cats', 'elephants', 'rats', 'rats', 'cats']
返回长度:
pluralLengths = (pluralRDD
                 .map(lambda x:len(x))
                 .collect())
print pluralLengths
#output:[4, 9, 4, 4, 4]
transform to pair RDD:
wordPairs = wordsRDD.map(lambda x: (x,1))
print wordPairs.collect()
#output:[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]



========================初探结束,接下来是基本功能实现======================



Counting with pair RDDs

groupByKey() approach
  • Step 1: group
wordsGrouped = wordPairs.groupByKey()
for key, value in wordsGrouped.collect():
    print '{0}: {1}'.format(key, list(value))
#output:
#   rat: [1, 1]
#   elephant: [1]
#   cat: [1, 1]
  • Step 2: count
wordCountsGrouped = wordsGrouped.mapValues(len)
print wordCountsGrouped.collect()
#output:[('rat', 2), ('elephant', 1), ('cat', 2)]
reduceByKey approach
from operator import add
wordCounts = wordPairs.reduceByKey(add)
print wordCounts.collect()
#output: [('rat', 2), ('elephant', 1), ('cat', 2)]

Pre-process of File

Goal: to remove all the

  • Capitalizations
  • Puctuations
  • Leading or trailing spaces
import re
def removePunctuation(text):

    text = ' '.join(text.split()#get rid of the leading or trailing spaces
    a = re.findall('[A-Za-z1-9\s]+',text)#remove all punctuations 
    b = ''.join(a)
    return b.lower()

print removePunctuation('Hi, you!')
print removePunctuation(' No under_score!')
#output:hi you
#no underscore
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值