Spark基础的transformation和 action的函数操作
函数示例:
map函数:
操作数据集:
In [1]: rdd = sc.textFile("file:/home/training/training_materials/data/weblogs/2014-03-15.log")
In [3]: rdd.map(lambda x: x.split()).take(2)
Out[3]:
[[u'234.206.18.239',
u'-',
u'8495',
u'[15/Mar/2014:23:59:30',
u'+0100]',
u'"GET',
u'/KBDOC-00082.html',
u'HTTP/1.0"',
u'200',
u'9054',
u'"http://www.loudacre.com"',
u'"Loudacre',
u'Mobile',
u'Browser',
u'Titanic',
u'2200"'],
[u'234.206.18.239',
u'-',
u'8495',
u'[15/Mar/2014:23:59:30',
u'+0100]',
u'"GET',
u'/theme.css',
u'HTTP/1.0"',
u'200',
u'4552',
u'"http://www.loudacre.com"',
u'"Loudacre',
u'Mobile',
u'Browser',
u'Titanic',
u'2200"']]
In [4]: rdd.map(lambda x: x.split()).map(lambda filed: (filed[0],filed[2])).take(2)
Out[4]: [(u'234.206.18.239', u'8495'), (u'234.206.18.239', u'8495')]
In [5]: rdd.map(lambda x: x.split()).map(lambda filed: (filed[0]+"/"+filed[2])).take(2)
Out[5]: [u'234.206.18.239/8495', u'234.206.18.239/8495']
In [8]: rdd.map(lambda x: len(x)).take(3)
Out[8]: [158, 151, 167]
byKey函数:
In [26]: rdd.keyBy(lambda x: x.split()[2]).take(2)
Out[26]:
[(u'8495',
u'234.206.18.239 - 8495 [15/Mar/2014:23:59:30 +0100] "GET /KBDOC-00082.html HTTP/1.0" 200 9054 "http://www.loudacre.com" "Loudacre Mobile Browser Titanic 2200"'),
(u'8495',
u'234.206.18.239 - 8495 [15/Mar/2014:23:59:30 +0100] "GET /theme.css HTTP/1.0" 200 4552 "http://www.loudacre.com" "Loudacre Mobile Browser Titanic 2200"')]
filter函数:
In [7]: rdd.filter(lambda x: ".jpg" in x).take(3)
Out[7]:
[u'64.190.73.51 - 74328 [15/Mar/2014:23:49:14 +0100] "GET /titanic_2400.jpg HTTP/1.0" 200 6021 "http://www.loudacre.com" "Loudacre Mobile Browser iFruit 5"',
u'25.101.226.55 - 72619 [15/Mar/2014:23:32:32 +0100] "GET /ifruit_2.jpg HTTP/1.0" 200 18225 "http://www.loudacre.com" "Loudacre Mobile Browser Sorrento F01L"',
u'34.5.4.77 - 72680 [15/Mar/2014:23:25:12 +0100] "GET /titanic_2100.jpg HTTP/1.0" 200 8290 "http://www.loudacre.com" "Loudacre Mobile Browser Titanic 2400"']
flatMap函数:
操作数据集:
In [9]: rdd = sc.textFile("file:/home/training/a.txt")
In [10]: rdd.flatMap(lambda x: x.split()).collect()
Out[10]: [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'e', u'f', u'g', u'h']
In [11]: rdd.flatMap(lambda x: x.split()).map(lambda x:(x,1)).collect()
Out[11]:
[(u'a', 1),
(u'b', 1),
(u'c', 1),
(u'd', 1),
(u'e', 1),
(u'f', 1),
(u'g', 1),
(u'h', 1),
(u'e', 1),
(u'f', 1),
(u'g', 1),
(u'h', 1)]
reduceByKey函数:
In [12]: rdd.flatMap(lambda x: x.split()).map(lambda x:(x,1)).reduceByKey(lambda x,y:(x+y)).collect()
Out[12]:
[(u'a', 1),
(u'c', 1),
(u'b', 1),
(u'e', 2),
(u'd', 1),
(u'g', 2),
(u'f', 2),
(u'h', 2)]
upper():
操作数据集
In [15]: rdd = sc.textFile("file:/home/training/c.txt")
In [16]: rdd.map(lambda x: x.upper()).collect()
Out[16]:
[u"I'VE NEVER SEEN A PURPLE COW.",
u'I NEVER HOPE TO SEE ONE;',
u'BUT I CAN TELL YOU, ANYHOW,',
u"I'D RATHER SEE THAN BE ONE."]
startswith('I')
In [20]: rdd.filter(lambda x: x.startswith('I')).collect()
Out[20]:
[u"I've never seen a purple cow.",
u'I never hope to see one;',
u"I'd rather see than be one."]
parallelize(collection)函数:
In [21]: myData = ["Alice","Carlos","Frank","Barbara"]
In [22]: myRdd = sc.parallelize(myData)
In [23]: myRdd.take(2)
Out[23]: ['Alice', 'Carlos']
union函数:
In [32]: rdd1 = sc.parallelize(['Chicago','Boston','Paris','San Francisco','Tokyo'])
In [33]: rdd2 = sc.parallelize(['San Francisco','Boston','Amsterdam','Mumbai','McMurdo Station'])
In [35]: rdd1.union(rdd2).collect()
Out[35]:
['Chicago',
'Boston',
'Paris',
'San Francisco',
'Tokyo',
'San Francisco',
'Boston',
'Amsterdam',
'Mumbai',
'McMurdo Station']
zip函数:
In [37]: rdd1.zip(rdd2).collect()
Out[37]:
[('Chicago', 'San Francisco'),
('Boston', 'Boston'),
('Paris', 'Amsterdam'),
('San Francisco', 'Mumbai'),
('Tokyo', 'McMurdo Station')]