Spark（二）

爱喝水的绿萝

已于 2022-01-22 20:10:33 修改

阅读量329

点赞数 1

分类专栏： spark 文章标签： spark big data 大数据

于 2021-12-13 14:50:18 首次发布

本文链接：https://blog.csdn.net/chaohui2638457321/article/details/121905261

版权

spark 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

这篇博客介绍了如何在Spark shell中执行词频统计任务，以及如何在IDEA中编写Spark程序进行相同操作。内容包括启动spark-shell，读取HDFS文件，使用flatMap、map、reduceByKey和sortBy等函数进行数据处理。同时，还展示了在IDEA中创建Spark项目，读取本地文件进行词频统计，并将结果保存回HDFS。

摘要由CSDN通过智能技术生成

3、执行Spark任务

spark-shell

Spark中提供了类似scala解释器的工具spark-shell，可以在命令行中直接连接集群并提交执行任务

spark-shell --master spark://bd0701COPY

编译器中编写Spark程序

1、启动spark-shell脚本

[root@zhaohui01 bin]# ./spark-shell --master spark://zhaohui01:7077

2、在编译器中做词频统计

# 读取hdfs的文本文件
scala> val rdd = sc.textFile("hdfs://zhaohui01:8020/Harry.txt")
rdd: org.apache.spark.rdd.RDD[String] = hdfs://zhaohui01:8020/Harry.txt MapPartitionsRDD[3] at textFile at <console>:24

#空格分隔为String类型的字符数组
scala> rdd.flatMap(_.split("\\s+"))
res4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at flatMap at <console>:24

#按照key相同的聚集，将value相加
scala> rdd.flatMap(_.split("\\s+")).map(_.toLowerCase.replaceAll("\\W","")).map((_,1)).reduceByKey(_+_)
res6: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:26

#按照value降序排序
scala> rdd.flatMap(_.split("\\s+")).map(_.toLowerCase.replaceAll("\\W","")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false).collect

res7: Array[(String, Int)] = Array(("",17416), (the,11768), (to,6516), (and,6206), (of,5333), (a,4858), (he,4640), (said,3917), (harry,3758), (was,3615), (his,3254), (in,3093), (you,3026), (it,2728), (i,2390), (had,2332), (that,2225), (at,2149), (on,1932), (as,1820), (her,1781), (him,1615), (with,1591), (not,1561), (she,1428), (but,1381), (for,1330), (they,1319), (hermione,1223), (ron,1191), (were,1143), (from,1080), (what,1079), (be,1062), (all,1036), (up,1014), (out,989), (them,985), (have,965), (so,815), (been,783), (back,783), (we,775), (this,762), (there,733), (is,719), (well,716), (into,694), (an,690), (who,679), (now,659), (just,644), (if,639), (could,627), (no,619), (me,612), (when,605), (about,600), (their,599), (sirius,590), (professor,585), (did,585)...

IDEA中编写Spark程序

1、词频统计

package com.zch.spark.core.exercise

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import scala.reflect.io.Path

/**
 * Author: zhaoHui
 * Date: 2021/12/08
 * Time: 13:52
 * Description: 
 */
object Exercise_SparkCoreDemo01_WordCount {

  val path = "F:\\JAVA\\bigdata2107\\zch\\spark\\src\\main\\resources\\Harry.txt"

  def main(args: Array[String]): Unit = {
    // 1. 创建Spark配置文件对象
    val conf = new SparkConf()
    // 1.1 设置当前任务需要提交到的集群master地址
    //      “spark://zhaohui01:7077"
    //    或者在本地测试时，也可以使用本地local作为本地环境master
    //    local[n]  n戴奥分配给当前任务的cpu核心数
    //    local[*]  将所有本地cpu都分配给当前任务
    conf.setMaster("local[1]")
    // 1.2 设置应用程序名称
    conf.setAppName("demo1")

    //1. 创建sc
    val sc = new SparkContext(conf)
    //2. sc加载文本文件生成RDD
    //    spark默认将文本中使用\n切分的内容
    //    一行作为一个元素
    val rdd:RDD[String]= sc.textFile(path)
    //        RDD.方法  => RDD
    val rdd1:RDD[String] = rdd.flatMap(_.split("\\s+"))
    //        RDD.方法  => RDD
    val rdd2:RDD[String] = rdd1.map(_.toLowerCase.replaceAll("\\W",""))
    //        RDD.方法  => RDD
    val rdd3:RDD[(String,Int)] = rdd2.map(x => (x, 1))
    //        RDD.方法  => RDD
    //    (w1,1)(w1,1)   [1,1,1]
    //                                  (v1,v2)=>v1 + v2
    val rdd4:RDD[(String,Int)] = rdd3.reduceByKey(_ + _)
    //        RDD.方法  => RDD
    val rdd5:RDD[(String,Int)] = rdd4.sortBy(_._2, ascending = false)
    //        RDD.方法  => Array[]
//    val array = rdd5.collect()
//    println(array.toList)
   // rdd5.foreach(println)

    sc.textFile(path)
      .flatMap(_.split("\\s+"))
      //处理rdd的线程数量
      .repartition(6)
      .map(x => {
        (x.toLowerCase
            .replaceAll("\\W",""),1)
      })
      //处理rdd的线程数量
      .reduceByKey(_+_,3)
      .sortBy(_._2,false)
      .foreach(println)
  }
}

1.1.读取HDFS集群文件存到集群

1.1.1 编写java代码

package com.zch.spark.core.exercise

import org.apache.spark.{SparkConf, SparkContext}

/**
 * Author: zhaoHui
 * Date: 2021/12/08
 * Time: 19:06
 * Description: 
 */
object Spark_FromHDFSReadFile {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setMaster("spark://zhaohui01:7077")
      .setAppName("wordCount")

    val sc = new SparkContext(conf)
    sc.textFile("hdfs://zhaohui01:8020/Harry.txt")
      .flatMap(_.split("\\s+"))
      .repartition(6)
      .map(x => {
        val w = x.toLowerCase()
          .replaceAll("\\W", "")
        (w, 1)
      })
      .reduceByKey(_ + _, 1)
      .sortBy(_._2, false)
      .saveAsTextFile("hdfs://zhaohui01:8020/WordCount")
  }
}

1.1.2 打架包