Spark Pair RDD键值对操作
Spark Pair RDD键值对操作
1 PairRDD简介
2 创建Pair RDD
3 Pair RDD的转化操作
3.1 聚合操作
1 PairRDD简介
Spark 为包含键值对类型的RDD 提供了一些专有的操作。这些RDD 被称为pair RDD1。PairRDD 是很多程序的构成要素,因为它们提供了并行操作各个键或跨节点重新进行数据分组的操作接口。例如,pair RDD 提供reduceByKey() 方法,可以分别归约每个键对应的数据,还有join() 方法,可以把两个RDD 中键相同的元素组合到一起,合并为一个RDD。我们通常从一个RDD 中提取某些字段(例如代表事件时间、用户ID 或者其他标识符的字段),并使用这些字段作为pair RDD 操作中的键。
2 创建Pair RDD
在Spark 中有很多种创建pair RDD 的方式,很多存储键值对的数据格式会在读取时直接返回由其键值对数据组成的pair RDD。此外,当需要把一个普通的RDD 转为pair RDD 时,可以调用map() 函数来实现,传递的函数需要返回键值对。后面会展示如何将由文本行组成的RDD 转换为以每行的第一个单词为键的pair RDD。
在Scala 中,为了让提取键之后的数据能够在函数中使用,需要返回二元组。隐式转换可以让二元组RDD 支持附加的键值对函数。
-
scala> var lines = sc.textFile(
"passwd")
-
lines: org.apache.spark.rdd.RDD[String] = passwd MapPartitionsRDD[
1] at textFile at <console>
:
27
-
-
scala> var pair = lines.map(x => (x.split(
":")(
0),x))
-
pair: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[
2] at map at <console>
:
29
-
-
scala> pair.take(
3).foreach(println)
-
(root,
root:
x:
0
:
0
:root
:/root
:/bin/bash)
-
(bin,
bin:
x:
1
:
1
:bin
:/bin
:/sbin/nologin)
-
(daemon,
daemon:
x:
2
:
2
:daemon
:/sbin
:/sbin/nologin)
当用Scala 从一个内存中的数据集创建pair RDD 时,只需要对这个由二元组组成的集合调用SparkContext.parallelize() 方法。
3 Pair RDD的转化操作
Pair RDD 可以使用所有标准RDD 上的可用的转化操作。由于pair RDD 中包含二元组,所以需要传递的函数应当操作二元组而不是独立的元素。
Pair RDD的转化操作(以键值对集合{(1, 2), (3, 4), (3, 6)}为例)
对两个pair RDD的转化操作(rdd = {(1, 2), (3, 4), (3, 6)}other = {(3, 9)})
-
scala>
var lines = sc.textFile(
"passwd")
-
lines: org.apache.spark.rdd.RDD[
String] = passwd MapPartitionsRDD[
6] at textFile at <
console>:
27
-
-
scala>
var pair = lines.map(
x => (x.split(
":")(
0),x.split(
":")(
6)))
-
pair: org.apache.spark.rdd.RDD[(
String,
String)] = MapPartitionsRDD[
7] at map at <
console>:
29
-
-
scala> pair.take(
3).foreach(println)
-
(root,/bin/bash)
-
(bin,/sbin/nologin)
-
(daemon,/sbin/nologin)
-
-
scala>
var bash = pair.filter(
u => u._2.contains(
"bash"))
-
bash: org.apache.spark.rdd.RDD[(
String,
String)] = MapPartitionsRDD[
8] at filter at <
console>:
31
-
-
scala> bash.take(
3).foreach(println)
-
(root,/bin/bash)
-
(hdfs,/bin/bash)
-
(yarn,/bin/bash)
3.1 聚合操作
当数据集以键值对形式组织的时候,聚合具有相同键的元素进行一些统计是很常见的操作。之前讲解过基础RDD 上的fold()、combine()、reduce() 等行动操作,pair RDD 上则有相应的针对键的转化操作。Spark 有一组类似的操作,可以组合具有相同键的值。这些操作返回RDD,因此它们是转化操作而不是行动操作。
reduceByKey() 与reduce() 相当类似;它们都接收一个函数,并使用该函数对值进行合并。reduceByKey() 会为数据集中的每个键进行并行的归约操作,每个归约操作会将键相同的值合并起来。因为数据集中可能有大量的键,所以reduceByKey() 没有被实现为向用户程序返回一个值的行动操作。实际上,它会返回一个由各键和对应键归约出来的结果值组成的新的RDD。
foldByKey() 则与fold() 相当类似;它们都使用一个与RDD 和合并函数中的数据类型相同的零值作为初始值。与fold() 一样,foldByKey() 操作所使用的合并函数对零值与另一个元素进行合并,结果仍为该元素。
可以使用reduceByKey() 和mapValues() 来计算每个键的对应值的均值
-
scala> val rdd = sc.parallelize(List((
"panda",
0),(
"pink",
3),(
"pirate",
3),(
"panda",
1),(
"pink",
4)))
-
rdd: org.apache.spark.rdd.RDD[(
String, Int)] = ParallelCollectionRDD[
9] at parallelize at <
console>:
27
-
-
-
scala> val kv = rdd.mapValues(
x => (x,
1)).reduceByKey(
(x, y) => (x._1 + y._1, x._2 + y._2)).map(
v => (v._1,v._2._1/v._2._2.toDouble))
-
kv: org.apache.spark.rdd.RDD[(
String, Double)] = MapPartitionsRDD[
17] at map at <
console>:
29
-
-
scala> kv.collect().foreach(println)
-
(panda,
0.5)
-
(pink,
3.5)
-
(pirate,
3.0)
熟悉MapReduce 中的合并器(combiner)概念的读者可能已经注意到,调用reduceByKey() 和foldByKey() 会在为每个键计算全局的总结果之前先自动在每台机器上进行本地合并。用户不需要指定合并器。更泛化的combineByKey() 接口可以让你自定义合并的行为。
PairRDD 的行动操作(以键值对集合{(1,2),(3,4),(3,6)})
<div class="hide-article-box hide-article-pos text-center">
<a class="btn-readmore" data-report-view="{"mod":"popu_376","dest":"https://blog.csdn.net/Zsigner/article/details/101012268","strategy":"readmore"}" data-report-click="{"mod":"popu_376","dest":"https://blog.csdn.net/Zsigner/article/details/101012268","strategy":"readmore"}">
展开阅读全文
<svg class="icon chevrondown" aria-hidden="true">
<use xlink:href="#csdnc-chevrondown"></use>
</svg>
</a>
</div>
<div class="aside-box">
<div id="kp_box_57" data-pid="57"><script async="" src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<div class="aside-box">
<div class="persion_article">
<div class="right_box footer_box csdn-tracking-statistics" data-report-view="{"mod":"popu_475"}"> <div class="contact-box" id="footer-contact-box"><div class="img-box"><img src="https://csdnimg.cn/pubfooter/images/csdn-cxrs.png" alt="程序人生" style="padding: 6px;width: 98px;height: 98px;"><p class="app-text">程序人生</p></div><div class="img-box fr"><a href="https://blog.csdn.net/csdnnews?utm_source=csdn_footer" target="_blank"><img style="padding: 6px;width: 98px;height: 98px;" src="//csdnimg.cn/pubfooter/images/csdn-zx.png" alt="CSDN资讯"></a><p class="app-text">CSDN资讯</p></div></div> <div class="contact-info"> <p><svg width="16" height="16" xmlns="http://www.w3.org/2000/svg"><path d="M2.167 2h11.666C14.478 2 15 2.576 15 3.286v9.428c0 .71-.522 1.286-1.167 1.286H2.167C1.522 14 1 13.424 1 12.714V3.286C1 2.576 1.522 2 2.167 2zm-.164 3v1L8 10l6-4V5L8 9 2.003 5z" fill="#5c5c5c" fill-rule="evenodd"></path></svg><a href="mailto:webmaster@csdn.net" target="_blank"><span class="txt">kefu@csdn.net</span></a> <em class="width126"><svg t="1538013544186" width="17" height="17" style="" viewBox="0 0 1024 1024" version="1.1" xmlns="http://www.w3.org/2000/svg" p-id="23556" xmlns:xlink="http://www.w3.org/1999/xlink"><defs><style type="text/css"></style></defs><path d="M902.60033922 650.96445566c-18.0718526-100.84369837-94.08399771-166.87723736-94.08399771-166.87723737 10.87530062-91.53186599-28.94715402-107.78733693-28.94715401-107.78733691C771.20003413 93.08221664 517.34798062 98.02553561 511.98620441 98.16348824 506.65661791 98.02553561 252.75857992 93.08221664 244.43541101 376.29988138c0 0-39.79946279 16.25547094-28.947154 107.78733691 0 0-75.98915247 66.03353901-94.0839977 166.87723737 0 0-9.63372291 170.35365477 86.84146124 20.85850523 0 0 21.70461757 56.79068296 61.50407954 107.78733692 0 0-71.1607951 23.19910867-65.11385185 83.46161052 0 0-2.43717093 67.16015592 151.93232126 62.56172014 0 0 108.5460788-8.0932473 141.10300432-52.14626271H526.33792324c32.57991817 44.05301539 141.10300431 52.1462627 141.10300431 52.14626271 154.3235077 4.59843579 151.95071457-62.56172013 151.95071457-62.56172014 6.00095876-60.26250183-65.11385185-83.46161053-65.11385185-83.46161052 39.77647014-50.99665395 61.4810877-107.78733693 61.4810877-107.78733692 96.45219231 149.49514952 86.84146124-20.85850523 86.84146125-20.85850523" p-id="23557" fill="#5c5c5c"></path></svg><a href="http://wpa.b.qq.com/cgi/wpa.php?ln=1&key=XzgwMDE4MDEwNl80ODc3MzVfODAwMTgwMTA2XzJf" class="qqcustomer_s" target="_blank"><span class="txt">QQ客服</span></a></em></p> <p><em class="width126"><svg t="1538012951761" width="17" height="17" style="" viewBox="0 0 1024 1024" version="1.1" xmlns="http://www.w3.org/2000/svg" p-id="23083" xmlns:xlink="http://www.w3.org/1999/xlink"><defs><style type="text/css"></style></defs><path d="M466.4934485 880.02006511C264.6019863 859.18313878 107.13744214 688.54706608 107.13744214 481.14947309 107.13744214 259.68965394 286.68049114 80.14660493 508.14031029 80.14660493s401.00286817 179.54304901 401.00286814 401.00286816v1.67343191C908.30646249 737.58941724 715.26799489 943.85339507 477.28978337 943.85339507c-31.71423369 0-62.61874229-3.67075386-92.38963569-10.60739903 30.09478346-11.01226158 56.84270313-29.63593923 81.5933008-53.22593095z m-205.13036267-398.87059202a246.77722444 246.77722444 0 0 0 493.5544489 0 30.85052691 30.85052691 0 0 0-61.70105383 0 185.07617062 185.07617062 0 0 1-370.15234125 0 30.85052691 30.85052691 0 0 0-61.70105382 0z" p-id="23084" fill="#5c5c5c"></path></svg><a href="http://bbs.csdn.net/forums/Service" target="_blank"><span class="txt">客服论坛</span></a></em> <svg t="1538013874294" width="17" height="17" style="" viewBox="0 0 1194 1024" version="1.1" xmlns="http://www.w3.org/2000/svg" p-id="23784" xmlns:xlink="http://www.w3.org/1999/xlink"><defs><style type="text/css"></style></defs><path d="M1031.29689505 943.85339507h-863.70679012A71.98456279 71.98456279 0 0 1 95.60554212 871.86883228v-150.85178906c0-28.58329658 16.92325492-54.46750945 43.13135785-65.93861527l227.99160176-99.75813425c10.55341735-4.61543317 18.24580594-14.0082445 20.72896295-25.23643277l23.21211998-105.53417343a71.95757195 71.95757195 0 0 1 70.28414006-56.51881307h236.95255971c33.79252817 0 63.02360485 23.5090192 70.28414004 56.51881307l23.21211997 105.53417343c2.48315701 11.25517912 10.17554562 20.62099961 20.72896296 25.23643277l227.99160177 99.75813425a71.98456279 71.98456279 0 0 1 43.13135783 65.93861527v150.85178906A71.98456279 71.98456279 0 0 1 1031.26990421 943.85339507z m-431.85339506-143.94213475c143.94213474 0 143.94213474-48.34058941 143.94213474-107.96334876s-64.45411922-107.96334877-143.94213474-107.96334877c-79.51500637 0-143.94213474 48.34058941-143.94213475 107.96334877s0 107.96334877 143.94213475 107.96334876zM1103.254467 296.07330247v148.9894213a35.97878598 35.97878598 0 0 1-44.15700966 35.03410667l-143.94213473-33.57660146a36.0057768 36.0057768 0 0 1-27.80056231-35.03410668V296.1002933c-35.97878598-47.98970852-131.95820302-71.98456279-287.91126031-71.98456279S347.53801649 248.11058478 311.53223967 296.1002933v115.385829c0 16.73431906-11.52508749 31.25538946-27.80056233 35.03410668l-143.94213473 33.57660146A35.97878598 35.97878598 0 0 1 95.63253297 445.06272377V296.07330247C162.81272673 152.13116772 330.77670658 80.14660493 599.47049084 80.14660493s436.63077325 71.98456279 503.81096699 215.92669754z" p-id="23785" fill="#5c5c5c"></path></svg>400-660-0108 </p> <p style="text-align:center">工作时间 8:30-22:00</p> </div> <div class="bg-gray"> <div class="feed_copyright"> <p><a class="right-dotte" href="//www.csdn.net/company/index.html#about" target="_blank">关于我们</a><a href="//www.csdn.net/company/index.html#recruit" target="_blank" class="right-dotte">招聘</a><a href="//www.csdn.net/company/index.html#contact" target="_blank" class="right-dotte">广告服务</a> <a href="https://www.csdn.net/gather/A" target="_blank" class="footer_baidu"> 网站地图</a></p> <p class="fz12_baidu"><a href="https://zn.baidu.com/cse/home/index" target="_blank"><svg width="13" height="14" xmlns="http://www.w3.org/2000/svg"><path d="M8.392 7.013c1.014 1.454 2.753 2.8 2.753 2.8s1.303 1.017.47 2.98c-.833 1.962-3.876.942-3.876.942s-1.122-.36-2.424-.072c-1.303.291-2.426.181-2.426.181s-1.523.037-1.957-1.888c-.434-1.927 1.52-2.982 1.666-3.161.145-.183 1.159-.873 1.81-1.963.653-1.09 2.608-1.962 3.984.181zm1.23 5.706V9.346H8.64v2.534h-.937s-.3-.044-.356-.285V9.33l-.925.015v2.518s.042.627.925.855h2.277zm-3.685.013V7.951l-.896-.014v1.295H3.987s-1.054.086-1.422 1.28c-.129.798.114 1.266.156 1.368.043.099.383.682 1.238.852h1.978zm-2.433-1.45c-.087-.286.013-.613.057-.741.042-.128.228-.427.61-.54h.855v1.948h-.797s-.555-.029-.725-.668zm6.877-8.775c-.143.909-.865 2.108-1.99 1.962-1.121-.144-1.375-1.16-1.267-2.179C7.214 1.458 8.21.18 9.007.364c.796.18 1.52 1.235 1.374 2.143zm-4.09-.345c0 1.197-.68 2.164-1.52 2.164S3.25 3.36 3.25 2.162C3.25.967 3.932 0 4.77 0c.842 0 1.52.967 1.52 2.162zm4.854 2.09c1.34 0 1.701 1.309 1.701 1.743 0 .438.182 2.29-1.485 2.326-1.667.037-1.737-1.126-1.737-1.96 0-.874.179-2.11 1.52-2.11zm-7.93.581c.045.398.253 2.217-1.27 2.544C.427 7.704-.14 5.947.028 5.124c0 0 .18-1.78 1.412-1.89.98-.085 1.7.986 1.774 1.6z" fill="#999" fill-rule="evenodd"></path></svg><em>百度提供站内搜索</em></a> <a href="http://www.miibeian.gov.cn/publish/query/indexFirst.action" target="_blank" class="ml14">京ICP备19004658号</a></p> <p class="fz12_baidu">©1999-2019 北京创新乐知网络技术有限公司 </p> </div> </div> <div class="allow-info-box"> <p><a href="https://csdnimg.cn/cdn/content-toolbar/csdn-ICP.png" target="_blank">经营性网站备案信息</a> <em class="width126"><a href="http://www.cyberpolice.cn/" target="_blank"><span>网络110报警服务</span></a></em></p> <p><a href="http://www.bjjubao.org/" target="_blank"><span>北京互联网违法和不良信息举报中心</span></a></p> <p><a href="http://www.12377.cn/" target="_blank"><span>中国互联网举报中心</span></a><a href="https://download.csdn.net/index.php/tutelage/" target="_blank"><span style="margin-left:8px">家长监护</span></a><a href="https://blog.csdn.net/blogdevteam/article/details/90369522" target="_blank"><span style="margin-left:8px">版权申诉</span></a></p> </div> </div></div>
</div>
</div>