目标:初探分布式计算平台Spark python API使用。
Create RDD
开始实现创建一个样本RDD
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
# Print out the type of wordsRDD
print type(wordsRDD)
#output: <class 'pyspark.rdd.RDD'>
Map
重要的transformation函数map的使用
自定义函数:
pluralRDD = wordsRDD.map(makePlural)
print pluralRDD.collect()
#output: ['cats', 'elephants', 'rats', 'rats', 'cats']
Lambda函数:
pluralLambdaRDD = wordsRDD.map(lambda x:x+'s')
print pluralLambdaRDD.collect()
#output:['cats', 'elephants', 'rats', 'rats', 'cats']
返回长度:
pluralLengths = (pluralRDD
.map(lambda x:len(x))
.collect())
print pluralLengths
#output:[4, 9, 4, 4, 4]
transform to pair RDD:
wordPairs = wordsRDD.map(lambda x: (x,1))
print wordPairs.collect()
#output:[('cat', 1), ('elephant', 1), ('rat', 1), ('rat', 1), ('cat', 1)]
========================初探结束,接下来是基本功能实现======================
Counting with pair RDDs
groupByKey() approach
- Step 1: group
wordsGrouped = wordPairs.groupByKey()
for key, value in wordsGrouped.collect():
print '{0}: {1}'.format(key, list(value))
#output:
# rat: [1, 1]
# elephant: [1]
# cat: [1, 1]
- Step 2: count
wordCountsGrouped = wordsGrouped.mapValues(len)
print wordCountsGrouped.collect()
#output:[('rat', 2), ('elephant', 1), ('cat', 2)]
reduceByKey approach
from operator import add
wordCounts = wordPairs.reduceByKey(add)
print wordCounts.collect()
#output: [('rat', 2), ('elephant', 1), ('cat', 2)]
Pre-process of File
Goal: to remove all the
- Capitalizations
- Puctuations
- Leading or trailing spaces
import re
def removePunctuation(text):
text = ' '.join(text.split()#get rid of the leading or trailing spaces
a = re.findall('[A-Za-z1-9\s]+',text)#remove all punctuations
b = ''.join(a)
return b.lower()
print removePunctuation('Hi, you!')
print removePunctuation(' No under_score!')
#output:hi you
#no underscore