Spark基础的transformation 和 action的函数操作

Spark基础的transformation和 action的函数操作



函数示例:

map函数:

操作数据集:


In [1]: rdd = sc.textFile("file:/home/training/training_materials/data/weblogs/2014-03-15.log")

In [3]: rdd.map(lambda x: x.split()).take(2)
Out[3]: 
[[u'234.206.18.239',
  u'-',
  u'8495',
  u'[15/Mar/2014:23:59:30',
  u'+0100]',
  u'"GET',
  u'/KBDOC-00082.html',
  u'HTTP/1.0"',
  u'200',
  u'9054',
  u'"http://www.loudacre.com"',
  u'"Loudacre',
  u'Mobile',
  u'Browser',
  u'Titanic',
  u'2200"'],
 [u'234.206.18.239',
  u'-',
  u'8495',
  u'[15/Mar/2014:23:59:30',
  u'+0100]',
  u'"GET',
  u'/theme.css',
  u'HTTP/1.0"',
  u'200',
  u'4552',
  u'"http://www.loudacre.com"',
  u'"Loudacre',
  u'Mobile',
  u'Browser',
  u'Titanic',
  u'2200"']]



In [4]: rdd.map(lambda x: x.split()).map(lambda filed: (filed[0],filed[2])).take(2)
Out[4]: [(u'234.206.18.239', u'8495'), (u'234.206.18.239', u'8495')]


In [5]: rdd.map(lambda x: x.split()).map(lambda filed: (filed[0]+"/"+filed[2])).take(2)
Out[5]: [u'234.206.18.239/8495', u'234.206.18.239/8495']

In [8]: rdd.map(lambda x: len(x)).take(3)
Out[8]: [158, 151, 167]


byKey函数:

In [26]: rdd.keyBy(lambda x: x.split()[2]).take(2)
Out[26]: 
[(u'8495',
  u'234.206.18.239 - 8495 [15/Mar/2014:23:59:30 +0100] "GET /KBDOC-00082.html HTTP/1.0" 200 9054 "http://www.loudacre.com"  "Loudacre Mobile Browser Titanic 2200"'),
 (u'8495',
  u'234.206.18.239 - 8495 [15/Mar/2014:23:59:30 +0100] "GET /theme.css HTTP/1.0" 200 4552 "http://www.loudacre.com"  "Loudacre Mobile Browser Titanic 2200"')]

filter函数:

In [7]: rdd.filter(lambda x: ".jpg" in x).take(3)
Out[7]: 
[u'64.190.73.51 - 74328 [15/Mar/2014:23:49:14 +0100] "GET /titanic_2400.jpg HTTP/1.0" 200 6021 "http://www.loudacre.com"  "Loudacre Mobile Browser iFruit 5"',
 u'25.101.226.55 - 72619 [15/Mar/2014:23:32:32 +0100] "GET /ifruit_2.jpg HTTP/1.0" 200 18225 "http://www.loudacre.com"  "Loudacre Mobile Browser Sorrento F01L"',
 u'34.5.4.77 - 72680 [15/Mar/2014:23:25:12 +0100] "GET /titanic_2100.jpg HTTP/1.0" 200 8290 "http://www.loudacre.com"  "Loudacre Mobile Browser Titanic 2400"']


flatMap函数:

操作数据集:


In [9]: rdd = sc.textFile("file:/home/training/a.txt")


In [10]: rdd.flatMap(lambda x: x.split()).collect()
Out[10]: [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'e', u'f', u'g', u'h']


In [11]: rdd.flatMap(lambda x: x.split()).map(lambda x:(x,1)).collect()

Out[11]: 
[(u'a', 1),
 (u'b', 1),
 (u'c', 1),
 (u'd', 1),
 (u'e', 1),
 (u'f', 1),
 (u'g', 1),
 (u'h', 1),
 (u'e', 1),
 (u'f', 1),
 (u'g', 1),
 (u'h', 1)]


reduceByKey函数:

In [12]: rdd.flatMap(lambda x: x.split()).map(lambda x:(x,1)).reduceByKey(lambda x,y:(x+y)).collect()
Out[12]: 
[(u'a', 1),
 (u'c', 1),
 (u'b', 1),
 (u'e', 2),
 (u'd', 1),
 (u'g', 2),
 (u'f', 2),
 (u'h', 2)]


upper():

操作数据集


In [15]: rdd = sc.textFile("file:/home/training/c.txt")


In [16]: rdd.map(lambda x: x.upper()).collect()
Out[16]: 
[u"I'VE NEVER SEEN A PURPLE COW.",
 u'I NEVER HOPE TO SEE ONE;',
 u'BUT I CAN TELL YOU, ANYHOW,',
 u"I'D RATHER SEE THAN BE ONE."]

startswith('I')

In [20]: rdd.filter(lambda x: x.startswith('I')).collect()
Out[20]: 
[u"I've never seen a purple cow.",
 u'I never hope to see one;',
 u"I'd rather see than be one."]


parallelize(collection)函数:

In [21]: myData = ["Alice","Carlos","Frank","Barbara"]

In [22]: myRdd = sc.parallelize(myData)

In [23]: myRdd.take(2)
Out[23]: ['Alice', 'Carlos']


union函数:


In [32]: rdd1 = sc.parallelize(['Chicago','Boston','Paris','San Francisco','Tokyo'])

In [33]: rdd2 = sc.parallelize(['San Francisco','Boston','Amsterdam','Mumbai','McMurdo Station'])

In [35]: rdd1.union(rdd2).collect()

Out[35]: 
['Chicago',
 'Boston',
 'Paris',
 'San Francisco',
 'Tokyo',
 'San Francisco',
 'Boston',
 'Amsterdam',
 'Mumbai',
 'McMurdo Station']

zip函数:

In [37]: rdd1.zip(rdd2).collect()
Out[37]: 
[('Chicago', 'San Francisco'),
 ('Boston', 'Boston'),
 ('Paris', 'Amsterdam'),
 ('San Francisco', 'Mumbai'),
 ('Tokyo', 'McMurdo Station')]



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值