大数据实战十六课（下）- Spark-Core04

最新推荐文章于 2024-04-13 02:01:27 发布

zhikanjiani

最新推荐文章于 2024-04-13 02:01:27 发布

阅读量1.2k

点赞数

本文链接：https://blog.csdn.net/zhikanjiani/article/details/99731015

版权

第一章：Spark监控概述

1.1 Spark监控概述
1.2 $SPARK_HOME下进行配置
1.3 Spark-shell本地测试

第二章：其它的监控方式

2.1 REST API
2.2 REST API的具体使用

第三章：Shared Variables

3.1 Broadcast Variables
3.1.1 普通的join
3.1.2 BroadCastJoin
3.2 Accumulator

一、Spark监控概述

启动spark-shell
执行sc.parallelize(List(1,2,3,4)).count
通过UI界面去查看，只有一个stage，4个task；需要关注启动时间多久，周期多久，GC time多久

场景：我们在本地运行的，推出spark这个运行界面；半夜跑spark任务，不管是任务结束还是任务挂了，就都结束了，什么信息也没了。

于是引出：Spark监控相关概念：
http://spark.apache.org/docs/latest/monitoring.html

Every SparkContext launches a Web UI，by default port 4040，that displays useful information about the application. This includes：

A list of scheduler stages and tasks //展示stage和task信息
A summary of RDD sizes and memory usage //列出RDD大小和内存信息
Environment information //环境相关信息
Information about the running executors //executor信息

You can access this interface by simply opening http://:4040 in a web browser. if multiple SparkContexts are running on the same host，they will bind to successive ports beginning with 4040(4041，4042，etc)

你能够访问到这个界面通过在浏览器上打开4040端口（启动页就会有网址信息）；如果你在相同的主机上启动了多个SparkContexts，他们的端口会依次递增

Note that this information is only available for the duration of the application by default.

这些信息仅仅只能在应用的生命周期中被访问到，意思是spark-shell关了，这些信息就访问不到了。

To view the web UI after the fact， set spark.eventLog.enabled to true before starting the application. This configures Spark to log Spark events thst encode the information displayed in the UI to persisted storage.

你要去看UI信息的话，在启动应用程序之前设置spark.eventLog.enabled参数为true；它会去记录这些事件信息，把这些信息保存在内存中。

默认场景下满足不了业务需求产生监控相关的东西。

1.1 Spark监控概述

it is still possible to construct the UI of an application through Spark’s history server，provided that the application’s event logs exist. You can start the history server by executing.

它能够通过Spark HistoryServer来访问UI，提供了应用程序的已经存在的事件日志
进入到$SPARK_HOME/sbin目录下，使用命令：./sbin/start-history-server.sh

This creates a web interface at http://:18080 by default, listing incomplete and completed applications and attempts.

它会列出完成的、重试的、未完成的应用程序信息，在默认的(ip/hostaname):18080端口上

when using the file-system provider class (see spark.history.provider below)，the base logging directory must be supplied in the spark.history.fs.logDirectory configuration option，and should contain sub-directories that each represents an application’s event log.

当使用文件系统提供类时（查看 spark.history.provider），这个基础日志目录一定需要被应用通过spark.history.fs.logDirectory 这个参数进行配置，能够包含子目录（每一个子目录都比松hi一个应用程序的event log）

The spark jobs themselves must be configured to log events, and to log them to the same shared，writable directory. For example，if the server was configured with a log directory of hdfs://namenode/shared/spark-logs，then the client-side options would be:

第一步：spark.eventLog.enabled true
第二步：spark.eventLog.dir hdfs://namenode/shared/spark-logs //开启以后，要设置hdfs存储目录

Environment Variables(环境变量)

SPARK_HISTORY_OPTS spark.history.* configuration options for the history server (default: none).

以spark.history开头的都需要配置到 SPARK_HISTORY_OPTS中

1.2 Spark_home下进行配置

Spark History Server Configuration Options

property name	default
spark.history.provider	org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory	file:/tmp/spark-events
spark.history.fs.update.interval	10s
spark.history.retainedApplications	50
spark.history.fs.cleaner.enabled	false
spark.history.fs.cleaner.interval	1d
spark.history.fs.cleaner.maxAge	7d

进入到SPARK_HOME/conf下配置：

cd $SPARK_HOME/conf 下，拷贝一份文件：cp spark-defaults.conf.template spark-defaults.conf；对这份文件进行编辑：vi spark-defaults.conf
cp spark-env.sh.template spark-env.sh；进入到编辑模式：./SPARK_HISTORY_OPTS

SPARK_HISTORY_OPTS = “-Dspark.history.fs.logDirectory=hdfs://hadoop002:9000/g6_directory”

./start-history-server.sh 注意启动前要保证这个/g6_directory日志目录在hdfs上有；

要去到$SPARK_HOME/logs下打印查看日志，查看是否正常启动。

访问hadoop002:18080端口

No completed applications found! 还会有如下提示：

Did you specify the corrrect logging directory ? Please verify your setting of spark.history.fs.logDirectory listed above and Whether you have the permissions to access it.

进行提示你的目录指定是否正确，你是否有权限访问

it is also possible that your application did not run to completion or did not stop the SparkContext.

你的应用程序没有运行成功或者你的sc没有停止掉

1.3 Spark-shell 本地测试

本地启动spark-shell，运行sc.parallelize(list(1,2,3,4)).count；再退出当前sc
去到hadoop002:18080端口上查看是否是否有信息；因为我们跑在本地，所以端口上的App ID名字显示时local
所有运行的信息全都有，和在hadoop002:4040端口下显示的页面是一样的

在这里插入图片描述

注意事项：

Note that in all of these UIs，the tables are sortable by clicking their headers，making it easy to identify slow tasks,data skew,etc
我们在18080端口上点击头部是可以进行排序的，非常容易去鉴别数据倾斜；进入到tasks，直接点击duration。

Note:

The history server displays both completed and incomplete Spark jobs. If an application makes multiple attemmpts after failures，the failed attempts will be displayed，as well as any ongoing incomplete attempt or the final successful attempt.

the history server 显示的包括完成和未完成的spark作业

Incomplete applications are only Updated intermittently. The time between updates is defined by the interval between checks for changed files(spark.history.fs.update.interval). On large clusters, the update interval may be set to large values. The way to view a running application is actually to view its web UI.

未完成的作业会根据事件进行控制

One way to signal the completion of a spark job is to stop the Spark Context explicitly(sc.stop())，or in Python using the with SparkContext() as sc ： construct to handle the Spark Context setup and tear down

通过这种方式sc.stop()把spark作业停下来

2.1 REST API的方式

In addition to viewing the metrics in the UI, thet are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. This JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1

自定以创建一个spark的监控程序，运行中的程序或者history server都能访问到JSON，对于正在运行的应用程序，可以使用：http://:18080/api/v1

In the API，an application is referenced by its application ID, [app-id] when running on YARN, each application may have multiple attempts, but there are attempt IDs only for applications in cluster mode, not applications in client mode. Applications in YARN cluster mode can be identified by their [attempt-id]. In the API listed below, when running in YARN cluster mode, [app-id] will actually be [base-app-id], where [base-app-id] is the YARN application ID.

一个应用程序通过application ID被引用；当我们spark on YARN的时候，每一个应用程序都能有多次尝试，多次尝试ID只针对cluster模式，不对client模式生效。

直接进入到UI界面查看：

2.2 REST API的具体使用

http://hadoop002:18080/api/v1/applications：返回的就是一个JSON串，如果有多个作业，就是有多个JOSN串。

hadoop002:18080/api/v1 此时是没有任何显示的
hadoop002:18080/api/v1/applications 返回一个JSON数组，拿到的是所有的应用程序

在这里插入图片描述
4. hadoop002:18080/api/v1/applications?status=[completed|running] 可以跟上状态，比如是否有正在运行的applications

在这里插入图片描述
5. hadoop002:18080/api/v1/applications/[app-id]/jobs 列出Jobs下的信息

6. /applications/[app-id]/jobs/[job-id] Details for the given job.

在这里插入图片描述
一般使用场景：服务搭好，前端配合（设计好UI），告诉前端接口。

Executor Task Metrics(指标信息)

Metrics：一般用不到

小结：主要关注点：HistoryServer和REST API

jps命令查看到HistoryServer就是一个Java进程；ps -ef|grep 端口号
HistoryServer不用的话使用命令停止：./stop-history-server.sh
记录的日志保存在哪：hdfs dfs -ls hdfs://hadoop002:9000/g6_directory

hdfs dfs -text hdfs://hadoop002:9000/g6_directory/app-id
这一串信息就是JSON，我们在REST API上查看到的JSON信息就是在此处解析出来的

第三章：（共享变量）Shared Variables

定义：

Normally, When a function passed to Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function .These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program . Supporting general , read-write shared variables across tasks would be inefficient. However , Spark does provide two limited types of shared
当一个算子再map或reduce端，执行在executor中；所有的变量都有一个副本，会把变量拷贝到每一台机器中去；默认情况多线程共享操作变量–效率是不高的。

val values = new HashMap()
val rdd = ....
rdd.foreach( x => {
		value  //.....				在算子里面用到了外面的一个属据
})

在算子中若直接操作外部的变量，spark会将普通的外部变量拷贝到每一个task上，这样不仅会吃很多内存，还会出现同时各自更改该变量如何保证都生效且不冲突的问题。spark引进了: broadcast variables（广播大变量） and accumulators（累加器）两个功能。

3.1 广播变量

定义：
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks

广播变量使用的是每一台机器一个副本，而不是每一个task一个copy
假设value的值有10M，有1000个task, 普通的执行方式，算子内部使用了外部的变量，这个变量
必须要拷贝到每一个task上去；所以就是10G；内存中耗费了太多资源。

引出广播变量：==> 广播变量每一个机器一个副本，而不是每一个task一个副本。

spark-shell中测试广播变量：

val broadcastVar = sc.broadcast (Array(1,2,3,4))
broadcastVar.value

这种方式在生产上永不了

3.1.1 普通的join

场景一：

info1	info2
G601 阿呆	G601 南京大学
G602 君永夜	G602 苏州大学
G622 血狼	G638 三江学院

操作：info1.join(info2)

package Sparkcore04

import org.apache.spark.{SparkConf, SparkContext}

object AccumulatorApp {
  def main(args: Array[String]): Unit = {
      val sparkConf = new SparkConf().setAppName("AccumulatorApp").setMaster("local[2]")
       val sc = new SparkContext(sparkConf)

    commonJoin(sc)
  sc.stop()
  }


      def commonJoin(sc:SparkContext):Unit={
        val info1 = sc.parallelize(Array(("G601","阿呆"),("G602","君永夜"),("G622","血狼")))

        val info2 = sc.parallelize(Array(("G601","南京大学"),("G602","苏州大学"),("622","三江学院")))


    //2个对应两个可以直接使用join
       info1.join(info2).foreach(println)

      }
}

输出结果：
(G601,(阿呆,南京大学))
(G602,(君永夜,苏州大学))

场景二：

info1	info2
G601 阿呆	G601 南京大学 24
G602 君永夜	G602 苏州大学 25
G622 血狼	G638 三江学院 27

 def commonJoin(sc:SparkContext):Unit={
        val info1 = sc.parallelize(Array(("G601","阿呆"),("G602","君永夜"),("G622","血狼")))

        val info2 = sc.parallelize(Array(("G601","南京大学","24"),("G602","苏州大学","25"),("622","三江学院","27")))
            .map(x => (x._1,x))			//进行分割

    //2个对应两个可以直接使用join
       info1.join(info2).foreach(println)

输出：对比两段代码不同的地方
(G602,(君永夜,(G602,苏州大学,25)))
(G601,(阿呆,(G601,南京大学,24)))

我们想要实现的结果是：G601，阿呆，南京大学
所以继续进行代码修改：
得到了我们想要的结果：

 //2个对应两个可以直接使用join
       info1.join(info2)
          .map(x =>{
            x._1 + "," + x._2._1 + "," + x._2._2._2
          })
         .foreach(println)

输出结果：
G601,阿呆,南京大学
G602,君永夜,苏州大学

让线程睡一会：

commonJoin(sc)
Thread.sleep(2000000)
sc.stop()
在浏览器中查看UI信息，localhost:4041

解析：

stage0是map
stage1是parallelize
stage2是join

BroadCast在生产上的使用场景：

3.1.12 Broadcast Join

broadcast出去以后就不会再用join来实现
大表的数据读取出来一条就和广播出去的小表的记录做匹配

package Sparkcore04

import org.apache.spark.{SparkConf, SparkContext}

object BroadCastJoin {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("AccumulatorApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    //这种方式是没有shuffle的，前提是小表的数据要小。
    broadcastJoin(sc)
    Thread.sleep(200000)
    sc.stop()
  }
      def broadcastJoin(sc:SparkContext):Unit ={

        //小 ==> 广播出去
        val info1 = sc.parallelize(Array(("G601","阿呆"),("G602","君永夜"),("G622","血狼")))
          .collectAsMap()     //Map() map.get访问到key


        //从Driver端广播出去
        val info1BroadCast = sc.broadcast(info1)

        //大
        val info2 = sc.parallelize(Array(("G601","南京大学","24"),("G602","苏州大学","25"),("G622","三江学院","27"),("G652","中国矿业大学","27")))
          .map(x => (x._1,x))

        info2.mapPartitions( x =>{
          //拿取info1中的信息
          val broadcastMap = info1BroadCast.value

          //遍历info1中的信息，如果1中包含key，value.2拿到的是info2中的第二个字段
          for((key,value) <- x if(broadcastMap.contains(key)))
            yield(key,broadcastMap.get(key).getOrElse(""),value._2)
        }).foreach(println)


  }
}

在这里插入图片描述

3.2 计数器（Accumulator）

Accumulators are variables that are only “added” to through an commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.
accumulator仅仅只支持"add"操作，它底层实现了一个counter，spark原生只是int型的累加操作或者自定义的。

spark-shell中进行测试：

1、scala> val accum = sc.longAccumulator("John Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

2、scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

3、scala> accum.value
res2: Long = 10

在UI界面上的stage中查看，有计数器信息；在每一个task中共享；底层对应4个Accumulators

zhikanjiani

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据实战十六课（下）- Spark-Core04

一、上次课回顾二、Shuffle剖析一、上次课回顾大数据实战第十五课(上)之-Spark-Core03：https://blog.csdn.net/zhikanjiani/article/details/91045640#id_4.2宽窄依赖定义，在容错方面定义spark on yarn（client、cluster）key-value编程YARN HADOOP_CONF_...
复制链接

扫一扫