3、执行Spark任务
spark-shell
-
Spark中提供了类似scala解释器的工具spark-shell,可以在命令行中直接连接集群并提交执行任务
spark-shell --master spark://bd0701COPY
编译器中编写Spark程序
1、启动spark-shell脚本
[root@zhaohui01 bin]# ./spark-shell --master spark://zhaohui01:7077
2、在编译器中做词频统计
# 读取hdfs的文本文件 scala> val rdd = sc.textFile("hdfs://zhaohui01:8020/Harry.txt") rdd: org.apache.spark.rdd.RDD[String] = hdfs://zhaohui01:8020/Harry.txt MapPartitionsRDD[3] at textFile at <console>:24 #空格分隔为String类型的字符数组 scala> rdd.flatMap(_.split("\\s+")) res4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[4] at flatMap at <console>:24 #按照key相同的聚集,将value相加 scala> rdd.flatMap(_.split("\\s+")).map(_.toLowerCase.replaceAll("\\W","")).map((_,1)).reduceByKey(_+_) res6: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:26 #按照value降序排序 scala> rdd.flatMap(_.split("\\s+")).map(_.toLowerCase.replaceAll("\\W","")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false).collect res7: Array[(String, Int)] = Array(("",17416), (the,11768), (to,6516), (and,6206), (of,5333), (a,4858), (he,4640), (said,3917), (harry,3758), (was,3615), (his,3254), (in,3093), (you,3026), (it,2728), (i,2390), (had,2332), (that,2225), (at,2149), (on,1932), (as,1820), (her,1781), (him,1615), (with,1591), (not,1561), (she,1428), (but,1381), (for,1330), (they,1319), (hermione,1223), (ron,1191), (were,1143), (from,1080), (what,1079), (be,1062), (all,1036), (up,1014), (out,989), (them,985), (have,965), (so,815), (been,783), (back,783), (we,775), (this,762), (there,733), (is,719), (well,716), (into,694), (an,690), (who,679), (now,659), (just,644), (if,639), (could,627), (no,619), (me,612), (when,605), (about,600), (their,599), (sirius,590), (professor,585), (did,585)...
IDEA中编写Spark程序
1、词频统计
package com.zch.spark.core.exercise import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} import scala.reflect.io.Path /** * Author: zhaoHui * Date: 2021/12/08 * Time: 13:52 * Description: */ object Exercise_SparkCoreDemo01_WordCount { val path = "F:\\JAVA\\bigdata2107\\zch\\spark\\src\\main\\resources\\Harry.txt" def main(args: Array[String]): Unit = { // 1. 创建Spark配置文件对象 val conf = new SparkConf() // 1.1 设置当前任务需要提交到的集群master地址 // “spark://zhaohui01:7077" // 或者在本地测试时,也可以使用本地local作为本地环境master // local[n] n戴奥分配给当前任务的cpu核心数 // local[*] 将所有本地cpu都分配给当前任务 conf.setMaster("local[1]") // 1.2 设置应用程序名称 conf.setAppName("demo1") //1. 创建sc val sc = new SparkContext(conf) //2. sc加载文本文件生成RDD // spark默认将文本中使用\n切分的内容 // 一行作为一个元素 val rdd:RDD[String]= sc.textFile(path) // RDD.方法 => RDD val rdd1:RDD[String] = rdd.flatMap(_.split("\\s+")) // RDD.方法 => RDD val rdd2:RDD[String] = rdd1.map(_.toLowerCase.replaceAll("\\W","")) // RDD.方法 => RDD val rdd3:RDD[(String,Int)] = rdd2.map(x => (x, 1)) // RDD.方法 => RDD // (w1,1)(w1,1) [1,1,1] // (v1,v2)=>v1 + v2 val rdd4:RDD[(String,Int)] = rdd3.reduceByKey(_ + _) // RDD.方法 => RDD val rdd5:RDD[(String,Int)] = rdd4.sortBy(_._2, ascending = false) // RDD.方法 => Array[] // val array = rdd5.collect() // println(array.toList) // rdd5.foreach(println) sc.textFile(path) .flatMap(_.split("\\s+")) //处理rdd的线程数量 .repartition(6) .map(x => { (x.toLowerCase .replaceAll("\\W",""),1) }) //处理rdd的线程数量 .reduceByKey(_+_,3) .sortBy(_._2,false) .foreach(println) } }
1.1.读取HDFS集群文件存到集群
1.1.1 编写java代码
package com.zch.spark.core.exercise import org.apache.spark.{SparkConf, SparkContext} /** * Author: zhaoHui * Date: 2021/12/08 * Time: 19:06 * Description: */ object Spark_FromHDFSReadFile { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setMaster("spark://zhaohui01:7077") .setAppName("wordCount") val sc = new SparkContext(conf) sc.textFile("hdfs://zhaohui01:8020/Harry.txt") .flatMap(_.split("\\s+")) .repartition(6) .map(x => { val w = x.toLowerCase() .replaceAll("\\W", "") (w, 1) }) .reduceByKey(_ + _, 1) .sortBy(_._2, false) .saveAsTextFile("hdfs://zhaohui01:8020/WordCount") } }
1.1.2 打架包
1.1.3 将架包上传至集群
1.1.4 执行架包
注意:
-
如果jar包打包失败,需要在idea中clean工程清理项目,然后执行项目,再重新打架包
-
--master xxxx 是spark服务的master主机的地址
-
--class 是java程序的引用 【copy Reference】 在跟上 jar 包的集群路径