Spark初步---spark-shell下RDD的wordcount实践

最新推荐文章于 2021-05-21 21:10:05 发布

fourierLouis

最新推荐文章于 2021-05-21 21:10:05 发布

阅读量827

点赞数

本文链接：https://blog.csdn.net/OliverChrist/article/details/79070001

版权

命令行提交spark任务./bin/spark-submit --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.11-2.1.0.jar 10000

./bin/spark-shell 进入spark shell页面

RDD：数据集合，集群里数据集合的映射
在spark-shell里，用scala语法建立RDD
scala> sc.textFile("/root/hello.txt")
res1: org.apache.spark.rdd.RDD[String] = /root/hello.txt MapPartitionsRDD[1] at textFile at <console>:25

把每一行读出来，作为一条单独的记录
scala> val lineRDD = sc.textFile("/root/hello.text")
lineRDD: org.apache.spark.rdd.RDD[String] = /root/hello.txt MapPartitionsRDD[3] at textFile at <console>:24

scala> lineRDD.foreach(println)
hello java
hello scala
hello c
hello python
hello shell

scala> lineRDD.collect
res4: Array[String] = Array(hello java, hello scala, hello c, hello python, hello shell, "")

对每行（每条记录做map）字符串形成数组
scala> val wordRDD = lineRDD.map(line => line.split(" "))
wordRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[6] at map at <console>:26

scala> wordRDD.collect
res5: Array[Array[String]] = Array(Array(hello, java), Array(hello, scala), Array(hello, c), Array(hello, python), Array(hello, shell), Array(""))

把数组中的每个单词都分出来，压平
scala> val wordRDD = lineRDD.flatMap(line => line.split(" "))
wordRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:26

scala> wordRDD.collect
res6: Array[String] = Array(hello, java, hello, scala, hello, c, hello, python, hello, shell, "")

把每个单词形成kv对，（key，1）的形式
scala> val wordCountRDD = wordRDD.map(word => (word,1))
wordCountRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[8] at map at <console>:28

scala> wordCountRDD.collect
res7: Array[(String, Int)] = Array((hello,1), (java,1), (hello,1), (scala,1), (hello,1), (c,1), (hello,1), (python,1), (hello,1), (shell,1), ("",1))

执行reduce，归并wordCountRDD中的value
scala> val resultRDD = wordCountRDD.reduceByKey((x, y) => x + y)
resultRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at <console>:30

scala> resultRDD.collect
res8: Array[(String, Int)] = Array((scala,1), (python,1), ("",1), (hello,5), (java,1), (shell,1), (c,1))

排序
scala> val orderedRDD = resultRDD.sortByKey()
orderedRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[10] at sortByKey at <console>:32

scala> orderedRDD.collect
res10: Array[(String, Int)] = Array(("",1), (c,1), (hello,5), (java,1), (python,1), (scala,1), (shell,1))

fourierLouis

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark初步---spark-shell下RDD的wordcount实践

命令行提交spark任务./bin/spark-submit --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.11-2.1.0.jar 10000./bin/spark-shell 进入spark shell页面RDD：数据集合，集群里数据集合的映射在spark-she
复制链接

扫一扫