大数据实战第十四课之-SparkCore02

最新推荐文章于 2022-02-09 15:44:29 发布

zhikanjiani

最新推荐文章于 2022-02-09 15:44:29 发布

阅读量225

点赞数

分类专栏：高级班Spark-Core

本文链接：https://blog.csdn.net/zhikanjiani/article/details/90786069

版权

高级班Spark-Core 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、上次课回顾

二、为什么选择Spark（why choose spark）

三、RDD创建、基本操作、深入详解

3.1 IDEA开发代码遇到的典型错误
3.2 求每个域名下的流量之和
3.3 求每个省份的访问次数的TOPN
3.4 LogApp的目的性

四、Spark的运行架构

4.1 总结
4.2 Cluster Mode Overview

一、上次课回顾

大数据实战第十二课-SparkCore01：

https://blog.csdn.net/zhikanjiani/article/details/99560528

二、为什么选择Spark（why choose Spark）

Fast：10倍于disk 100倍于memory
Easy：code easier、interactive shell
Unified Stack（统一的堆栈）： Batch、Streaming、ML、Graph
Deployment（部署）：Local、Standalone、YARN、K8s
Multi Language（多语言支持）：Scala、Java、Python、R

三、RDD创建、基本操作、深入详解

RDD(Resilent Distributed Dataset)弹性分布式数据集

参考如下博客：

RDD的创建：https://blog.csdn.net/zhikanjiani/article/details/90613976
RDD的基本操作（一）：https://blog.csdn.net/zhikanjiani/article/details/97833470
RDD的基本操作（二）：https://blog.csdn.net/zhikanjiani/article/details/97902220
RDD深入详解：https://blog.csdn.net/zhikanjiani/article/details/90575957

RDD的创建方式：
textFile：local(本地)、只要是hdfs兼容的都可以(S3)
parallelize:test（此种方法适合于测试）

Transformation：
特点：Lazy 延迟执行，写一堆代码并不会马上执行

Action:算子
eager
return a value to Driver(返回结果到Driver)

典型action算子：
collect、reduce、count、take

3.1 IDEA开发代码遇到的典型错误

代码如下：

package SparkCore01

import org.apache.spark.{SparkConf, SparkContext}

object LogApp {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf();
    sparkConf.setAppName("LogApp").setMaster("local[2]")

    val sc = new SparkContext()
    sc.parallelize(List())

    sc.stop()
  }

}

运行得到的结果：
19/08/14 16:37:29 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration

查看SparkContext源码：
def this() = this(new SparkConf()) //在未传入参数时，他会new一个SparkConf，使我们原先定义的sparkConf失效。
修改：
val sc = new SparkContext(sparkConf) //此时就把参数传进来了，class SparkContext(config: SparkConf) extends Logging {

3.2、求每个域名下的流量之和

数据来源：python日志生成器生成的日志

需求分析：

域名(11字段)、流量(20字段)两个字段即可
实现：按照域名进行分组，然后组内求流量之和
求和：算子用reduceByKey
本质就是一个wordcount

package com.ruozedata.bigdata

import org.apache.spark.{SparkConf, SparkContext}

object LogApp {
  def main(args: Array[String]):Unit =  {
  	val sparkConf = new SparkConf()
  	sparkConf.setAppName("LogApp").setMaster("local[2]")
  	val sc = new SparkContext(sparkConf)

	val lines = sc.textFile("file:///C:/Users/Administrator/Desktop/baidu.log")
	lines.take(num = 4).foreach(println)
  }
  //TODO..........求每个域名的流量之和
  //实现：按照域名进行分组，然后组内求流量之和，是不是WC?

lines.map(x => {
	val splits = x.split("\t")
	val domain = splits(10)	//index from zero
	val traffic = splits(19).toLong
	(domain,traffic) 
}).reduceByKey(_+_).collect.foreach(println)

小结：
对于处理的日志：别想当然是正确的，虽然说日志中每个字段的含义及分隔符是什么都是事先定义好的，但是不能保证前端采集的日志或Nignx打的日志就正确的么？
那么这些数据就是脏数据，如何解决？

解决方法：

需要从代码健壮性而言要判断代码长度，修改如下：

第一次修改：

package SparkCore01

import org.apache.spark.{SparkConf, SparkContext}

object LogApp {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf();
    sparkConf.setAppName("LogApp").setMaster("local[2]")

    val sc = new SparkContext(sparkConf)

    var lines = sc.textFile("file:///C:/Users/Administrator/Desktop/a.log")

    //TODO....求每个域名下的流量之和
    /*
    *需求分析：域名(11字段)、流量(20字段)两个字段即可
    * 实现：按照域名进行分组，然后组内求流量之和
    * 求和：算子用reduceByKey
    * 就是一个wordcount
     */

    //1、读取每一行数据,然后只要第11、第20个字段
    lines.map(  x => {
      val splits = x.split("\t")

      val length = splits.length
      if (length == 5)
        {
          val domain = splits(0)
          val traffic = splits(3).toLong			//此处使用toLong
          (domain,traffic)
        }else{
        ("-",0L)								//所以这边0L
      }
    }).reduceByKey(_+_).collect.foreach(println)
    sc.stop()
  }

}

经过第一次修改后是否还有安全隐患，如果第19个字段是字符串，那么它是不能转换成toLong的；所以进行一个try catch

package com.ruozedata.bigdata

import org.apache.spark.{SparkConf, SparkContext}

object LogApp {
  def main(args: Array[String]):Unit = {
    val sparkConf = new SparkConf()
    sparkConf.setAppName("LogApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val lines = sc.textFile("file:///C:/Users/Administrator/Desktop/baidu.log")
//    lines.take(10).foreach(println)

    lines.map( x => {
      val splits = x.split("\t")
      val length = splits.length
      var traffic = 0L

      if (length == 72) {
        val domain = splits(11)
        try {
        val traffic = splits(3)}					//如果这个字段异常，走到catch
        catch{	
          case e :Exception => 0L					//把它赋为0
        }
        (domain, traffic)
      }
      else{
        ("-",0L)				
      }
    }).reduceByKey(_+_).collect.foreach(println)
  }

}

此时catch不写 case e :Exception => 0L，这个traffic其实还是0.

3.3、求每个省份的访问次数的top10

需求分析：本质就是引入纯真库，编写代码进行解析

如何使用纯真库解析？
https://blog.csdn.net/adayan_2015/article/details/88580988

下载项目，进入到github对应的网址：https://github.com/wzhe06/ipdatabase
下载到本地，进入cmd控制台使用maven命令进行编译，注意需要进入到文件的根目录：mvn clean package -DskipTests=true
执行mvn命令上传到本地Maven仓库，maven命令如下：

mvn install:install-file -Dfile=C:\Users\Administrator\Desktop\ipdatabase-master\ipdatabase-master\target\ipdatabase-1.0-SNAPSHOT.jar -DgroupId=com.ggstar -DartifactId=ipdatabase -Dversion=1.0 -Dpackaging=jar

解析：

-Dfile = 后跟的是打包编译后的jar包位置所在全路径名称
-DgroupId = 后跟的是包的名称
-DartificatId = 后跟的是项目的名称
-Dversion = 后面跟的是版本的名字

如何表示成功：

CMD控制台下出现如下界面则说明已经上传到MAVEN本地仓库成功。
在这里插入图片描述

4、此时可以在你们Spark项目中引入依赖，并且这个依赖不会报错：

<dependency>
      <groupId>com.ggstar</groupId>
      <artifactId>ipdatabase</artifactId>
      <version>1.0</version>
    </dependency>

    <dependency>
      <groupId>org.apache.poi</groupId>
      <artifactId>poi-ooxml</artifactId>
      <version>3.14</version>
    </dependency>

5、在文件下新建一个Resource目录，把下载的源代码中的两个文件"ipDatabase.csv"和"ipRegion.xlsx"复制到resources目录下
在这里插入图片描述
6、file --> Project Structure --> Modules：进行设置

7、编写测试代码：

根据一个ip获得省份：

package SparkCore01

import com.ggstar.util.ip.IpHelper

object test {
  def getCity(ip: String)={
    IpHelper.findRegionByIp(ip)

  }

  def main(args: Array[String]): Unit = {
    println(getCity("112.80.63.242"))
  }

}

输出：江苏省

具体解析到：江苏省、苏州市、联通；那么外国ip是否能够解析呢，解析出一些本地局域网的

需求分析：

（省份，1）reduceByKey(+)
要用ip解析库：纯真、淘宝
IP库解析请自行百度：

代码如下：

一行核心代码需要由N多行其它代码来保障的。
val splits = x.split("\t")
val length = splits.length
if(length == 73) {
	val args(6) = splits(6)    //ip:port   192.168.137.252:8080
	val serverPort = args6.split(":")
	val ip = ""
	if (serverport.length == 2){
		ip = serverport(0)
		}else{
		ip = args6
	}
	val province = IPUtil.getInstance().getIpInfos(ip)(1)
	(province,1)
	} else {
	("-",1)
	}
	}).reduceByKey(_+_).sortBy(_.2,false).take(10).foreach(println)
	-------------------------------------------------------------------------------

3.4、LogApp的目的性

了解Spark如何进行大数据业务处理
掌握生产上边界值得处理 ==> 使得代码的健壮性有保障

4、Spark的运行架构

Spark的运行架构 *****
重要指数：五颗星 http://spark.apache.org/docs/latest/cluster-overview.html

The following table summarize terms you’will see used to refer to cluster concepts:

Term	Meaning
1、Application(spark application)	User program built on Spark. Consists of a driver program + executors on the cluster
2、Application jar	A jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
3、Driver Program	The process running the main() function of the application and creating the SparkContext
4、Cluster Manager(集群管理器)	An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
5、Deploy mode	Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.
6、Worker node	Any node that can run application code in the cluster
7、Executor	A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
8、Task	A unit of work that will be sent to one executor
9、Job	A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you’ll see this term used in the driver’s logs.
10、Stage	Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.

Application 程序是由spark构建：

Spark Application = a driver program + executors on the cluster.

Application jar：

IDEA下对maven工程打包生成的jar包，这个包就叫application jar.

Driver Program：

这是一个进程来运行application中的main方法，并且创建了一个SparkContext.

Cluster manager（集群管理器）:

an external service for acquiring resources on the cluster.
==> 通过cluster manager去申请集群上的资源（eg.standalone、manager、Mesos、YARN）

Deploy mode（部署模式）：

Distinguishes where the driver process runs(区分driver进程跑在哪里) **
client: Driver run Local
Cluster: Driver run Cluster

Worker node：any node that can run application node in the cluster.

对于yarn来说，Worker node就是NM上运行一个container。

Executor ：

a process launched for an application on a worker node,(一个应用程序的进程)
that runs tasks and keeps data in memory or a disk storage across them（把数据放在内存或磁盘上面存储，每一个executor可以跑多个tasks）
Each application has its own executors ：每一个应用程序有他自己的executors
如下图所示：NodeManager上跑container，Application和App虽然是跑在一个NodeManager上的，上面三个和下面三个是相互独立的。

引申MapReduce的执行流程：

找一个container启动ApplicationMaster，通过AM去ResourceManager上拿资源，在NodeManager上运行MapTask和ReduceTask

Task: A Unit of work that will be sent to one executor

每一个executor中能运行多个task

Job:

A parallel computation consisting of multiple tasks that gets spawned
in response to a Spark action (eg.save,collect) 只要Spark上遇到一个action就是一个job。

Stage： Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you will see this term used in Driver’s logs

4.1、总结

什么叫应用程序？

Application包含一个driver和多个executor；
driver是运行应用程序的main方法并且在里面创建SparkContext；
Cluster manager : Cluster Manager是一个服务用于申请资源用的；
DeployMode：区分driver跑在本地还是集群上的；
Work Node：运行我们executor进程的，对于YARN来说就是在NodeManager上运行一个Container；
Executor：是一个进程，来运行我们的task，但这个task是多个的，是tasks，它里面可以存数据（数据可以存在内存或者磁盘上）；
Job：遇到一个action就会创建一个job，job中有很多task，task是执行的最小单元，task将会被发送到executor上执行.

4.2 Cluster Mode Overview：

Components(组件):

Spark applications run as independent sets of processes on a cluster, coordinately by the sparkContext object in your main program (called the driver program).

spark应用程序运行是一组独立的进程在集群上的，通过SparkContext对象在main程序中协调的。

Specifically，to run on a cluster, the SparkContext can connect to several types of cluster managers(either Spark’s own standalone cluster manager.Mesos or YARN)，which allocate resources across applications.(which 指的是cluster manager)；
once connected,Spark acquires executors on nodes in the cluster, which(指的是executor) are processes that run computations and store data for your application，
Next it（it指的是Driver Program） sends your application code (definde by JAR or Python files passes to SparkContext) to the executors, Finally, SparkContext sends tasks to the executors to run.

There are several useful things to note about this arthitecture：

1、每一个application有他自己独立的进程：

Each application gets its own executor process(每一个应用程序有它自己的进程), which（which指的是executor） stay up for the duration of the whole application and run tasks in multiple threads.
this has the benefit of isolating applications from each other, on both the scheduling（调度） side(each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMS).
However,it also means that data cannot be shared across different Spark applications(instances of SparkContext) without writing it to an external storage system（除非把多个共享的应用程序写到外部存储中去）.
扩展：alluxio框架（分布式内存的存储框架）

2、Spark不关心底层的集群管理：

Spark is agnotic to（不关心） the underlying cluster manager（底层的集群管理）.

代码一样，不关注你跑在yarn还是standalone上.

As long as it can acquire executor processes，and these communicate with each other，

一旦Spark拿到执行进程，进程之间会进行通信

it is relatively easy to run it even on a cluster manager that also supports other applications(eg, Mesos/YARN)

Spark运行在集群上相对容易，它也能运行在Mesos，Yarn上.

3、driver和executor网络一定要通
1.The driver program must listen for and accept incoming connections from its executors throughout its lifetime(eg see spark driver port in the network config section) as such , the driver program must be network addressable from the worker nodes.

driver program必须要监听和接收一些连接从它的executors，网络一定要通.

4、Driver调度作业尽可能靠近NodeManager

because the driver schedules tasks on the cluster , it(指的是driver) should be run close to the worker nodes, preferably on the same local area network（尽可能的在相同区域的本地网络）.

因为driver是调度作业在集群上运行的，driver应该尽可能的运行在相近的工作节点，尽可能的让他们在相同区域的本地网络.

if you would like to send requests to the cluster remotely（远端的）. it’s better to open an RPC（Remote Procedure call 远端过程调用） to the driver and have it submit operations from nearbly than to run a driver far away from the worker nodes.