Spark初步---spark-shell下RDD的wordcount实践

命令行提交spark任务./bin/spark-submit --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.11-2.1.0.jar 10000


./bin/spark-shell 进入spark shell页面


RDD:数据集合,集群里数据集合的映射
在spark-shell里,用scala语法建立RDD
scala> sc.textFile("/root/hello.txt")
res1: org.apache.spark.rdd.RDD[String] = /root/hello.txt MapPartitionsRDD[1] at textFile at <console>:25


把每一行读出来,作为一条单独的记录
scala> val lineRDD = sc.textFile("/root/hello.text")
lineRDD: org.apache.spark.rdd.RDD[String] = /root/hello.txt MapPartitionsRDD[3] at textFile at <console>:24


scala> lineRDD.foreach(println)
hello java
hello scala
hello c
hello python
hello shell


scala> lineRDD.collect
res4: Array[String] = Array(hello java, hello scala, hello c, hello python, hello shell, "")


对每行(每条记录做map)字符串形成数组
scala> val wordRDD = lineRDD.map(line => line.split(" "))
wordRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[6] at map at <console>:26


scala> wordRDD.collect
res5: Array[Array[String]] = Array(Array(hello, java), Array(hello, scala), Array(hello, c), Array(hello, python), Array(hello, shell), Array(""))


把数组中的每个单词都分出来,压平
scala> val wordRDD = lineRDD.flatMap(line => line.split(" "))
wordRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at flatMap at <console>:26


scala> wordRDD.collect
res6: Array[String] = Array(hello, java, hello, scala, hello, c, hello, python, hello, shell, "")


把每个单词形成kv对,(key,1)的形式
scala> val wordCountRDD = wordRDD.map(word => (word,1))
wordCountRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[8] at map at <console>:28


scala> wordCountRDD.collect
res7: Array[(String, Int)] = Array((hello,1), (java,1), (hello,1), (scala,1), (hello,1), (c,1), (hello,1), (python,1), (hello,1), (shell,1), ("",1))


执行reduce,归并wordCountRDD中的value
scala> val resultRDD = wordCountRDD.reduceByKey((x, y) => x + y)
resultRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at <console>:30


scala> resultRDD.collect
res8: Array[(String, Int)] = Array((scala,1), (python,1), ("",1), (hello,5), (java,1), (shell,1), (c,1))


排序
scala> val orderedRDD = resultRDD.sortByKey()
orderedRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[10] at sortByKey at <console>:32


scala> orderedRDD.collect
res10: Array[(String, Int)] = Array(("",1), (c,1), (hello,5), (java,1), (python,1), (scala,1), (shell,1))
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值