大数据实战十七课(上)- Spark-Core05

第一章:上次课回顾

第二章:Map和MapPartition

第三章:sc.textFile源码剖析

第四章:Spark优化

第一章:上次课回顾

大数据实战十六课(下)- Spark-Core04
https://blog.csdn.net/zhikanjiani/article/details/99731015

第二章:MapPartition

  • 在高阶函数中,做一个map就是对函数中每一个元素做一个映射。y = f(x)

1、Map在RDD.scala中的定义:

  Return a new RDD by applying a function to all elements of this RDD.
//返回的是一个RDD,RDD中的每一个元素都作用上一个函数

  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

2、MapPartition在RDD.scala中的定义:


   Return a new RDD by applying a function to each partition of this RDD.
//返回一个新的RDD,作用到RDD上的每一个Partition上

 `preservesPartitioning` indicates whether the input function preserves the partitioner, which
 should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
  
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }
  • mapPartition:作用到分区 map:作用到记录
    RDD <-- N个partition构成 <-- N个Record

举例:

  1. RDD中 有10个分区 每个分区100W条数据
    rdd save MySQL
    使用map:connection 1000W条记录
    使用mapPartition:10个分区
package Sparkcore04

import java.util.Random

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object mapPartition {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf()
    sparkConf.setAppName("MapPartitionApp").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)


    val stus = new ListBuffer[String]
    for (i <- 1 to 100)
    {
      stus += "G60" + i
    }

    val rdd = sc.parallelize(stus)
    rdd.map( x =>{
      val conn = DB.getConn()

      println(conn + "~~~")
      DB.returnConn(conn)

    }).collect

    sc.stop()
  }

}


      object DB{
        def getConn() = {
          new Random().nextInt(10)+""
        }

        def returnConn(conn:String)={

        }
      }

使用mapPartition:

 val rdd = sc.parallelize(stus)
    println("一共有几个分区:" + rdd.partitions.length)
    rdd.mapPartitions( partition =>{
      val conn = DB.getConn()
      println(conn + "~~~~")
      DB.returnConn(conn +"~~~")
      partition
    }).collect

    sc.stop()

输出:
一共有几个分区:2
0~~~~
5~~~~

对于批处理:要有一批一批的概念,在代码中用事物控制
Java中使用aop控制,在大数据中,先删除再插入不是更简单么。

  • 写一个sql,点一个按钮提交,把结果返回回来,你知道作业要跑几个小时么?

WC flatMap map

2.1 foreachPartition

foreachPartition在RDD.scala中的定义:

    Applies a function f to each partition of this RDD.
   //作用上一个函数到RDD中的每一个分区上去
   
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

  • 看到函数中有runJob,所以它是一个action
  1. 只要是存储到外部数据库(Hbase、MySQL)中,首选算子:foreachPartition
  2. map和mapPartition是转换Transformation的,使用collect手动触发

第三章: sc.textFile源码剖析

sc.textFile(" ") //把文件或文件夹转换成一个RDD

底层源码做了什么事情:

Read a text file from HDFS, a local file system (available on all nodes), or any
 Hadoop-supported file system URI, and return it as an RDD of Strings.
@param path path to the text file on a supported file system
@param minPartitions suggested minimum number of partitions for the resulting RDD
@return RDD of lines of the text file
// 去hdfs上读一个文本文件,一个本地文件系统(standalone模式下能够在任何节点被访问到),或者任何hadoop能够支持的文件系统:Hbase、S3,当做RDD的一个String给他返回

  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

重点:

  1. hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
  • 这一段相当于是读到一个文件
  1. minPartitions).map(pair => pair._2.toString).setName(path)
  • 关键:map(pair => pair._2.toString) ==> 只获取value,不要偏移量。

textFile调用的是hadoopFile

/** Get an RDD for a Hadoop file with an arbitrary InputFormat
   *
   * @note Because Hadoop's RecordReader class re-uses the same Writable object for each
   * record, directly caching the returned RDD or directly passing it to an aggregation or shuffle
   * operation will create many references to the same object.
   * If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first
   * copy them using a `map` function.
   * @param path directory to the input data files, the path can be comma separated paths
   * as a list of inputs
   * @param inputFormatClass storage format of the data to be read
   * @param keyClass `Class` of the key associated with the `inputFormatClass` parameter
   * @param valueClass `Class` of the value associated with the `inputFormatClass` parameter
   * @param minPartitions suggested minimum number of partitions for the resulting RDD
   * @return RDD of tuples of key and corresponding value
   */
  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    FileSystem.getLocal(hadoopConfiguration)

    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }
  • 解析:new了一个HadoopRDD
  • 上面这些代码调用的就是reduce的代码

面试题:Mapper<>、Reducer<> 中有4个参数

  • public interface Mapper<K1, V1, K2, V2> extends JobConfigurable, Closeable
    对应的map、reduce方法中是3个参数

3.1 了解Spark-shell的启动流程

  1. cd $SPARK_HOME/bin more spark-shell
  • cygwin = false 有人想在windows环境部署学习大数据,检查系统环境
  1. case “$(uname)” in
    CYGWIN*) cygwin=true;
  • uname 检查系统信息 uname -r 显示操作系统的发行版号 uname -a 打印所有信息

在这里插入图片描述

  1. if [ -z " S P A R K H O M E &quot; ] ; t h e n / / 查 看 {SPARK_HOME}&quot; ]; then //查看 SPARKHOME"];then//SPARK_HOME是否为0
    source “$(dirname “$0”)” /find-spark-home //当前目录下肯定有find-spark-home这个目录
    fi
  • 判断是否有SPARK_HOME,没有的话走下一句话,

shell中进行测试:

1、vi test.sh
teacher="ruoze"
if [ -z "${teacher}" ]; then
echo "jepson"
else {
	echo ${teacher}
}
fi 

2、chmod +x test.sh

3、./test.sh		此时输出是ruoze
//这段代码的意思是:
设置老师是若泽,判断老师这个参数是否为空,空的话输出jepson;非空的话,输出若泽
#teacher的话,把teacher注释掉的话,输出就是jepson

source “$(dirname “$0”)” /find-spark-home的含义:

1、vi test.sh
home=`cd $(dirname "$0");pwd`
echo ${home}

2、chmod +x test.sh

3、./test.sh
执行当前脚本所在目录
  1. function main(){

    }
    main “$@”

测试:main “$@”

1、function main() {
	echo "input params is:"$@
}
main "$@"

2、./test.sh xx yy zz

3、输出:input params is: xx yy zz

  • else中执行的语句:
    export SPARK_SUBMIT_OPTS
    KaTeX parse error: Expected '}', got 'EOF' at end of input: …"Spark shell" "@”

小结:spark-shell底层调用的是spark-submit,spark-shell运行的时候名字就叫这个,所有的参数跟在spark-shell后面,比如–master等等
此时走到spark-submit。

  1. exec “ S A P R K H O M E &quot; / b i n / s p a r k − c l a s s o r g . a p a c e h . S p a r k S u b m i t &quot; {SAPRK_HOME}&quot; /bin/spark-class org.apaceh.SparkSubmit &quot; SAPRKHOME"/bin/sparkclassorg.apaceh.SparkSubmit"@”
  • exec表示执行

cd $HADOOP_HOME
cd sbin
cat start-all.sh

  • spark-shell
    spark-submit
    spark-class

第四章:Tuning Spark(Spark调优)

  • 首先要明白Spark有哪些地方可以优化
  1. Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked by any resource in the cluster:CPU,network bandwidth,or memory. Most often, if the data fits in memory,the bottleneck is network bandwidth, but sometimes, you also need to do some tuning, such as storing RDDs in serialized form, to decrease memory usage. This guide will cover two main topics: data serialization, which is crucial for good network performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.
  • 因为大多数spark计算都是基于内存的,在集群中:CPU 网络带宽 内存会成为Spark程序的瓶颈;如果数据存储在内存中,那么spark程序的瓶颈就是网络带宽,但有时候,你同样需要做一些优化:比如存储RDD以序列化的方式,为了减少内存使用率。
  • 这个指导包含如下两种方式:数据序列化,这对于拥有一个好的网络性能和减少内存使用来说是至关重要的。

集群各方面资源都不够用Spark个鬼哦,CPU+memory = 资源;网络带宽现在一般都是万兆的。

4.1 Data Serialization(数据序列化)

  1. Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries.
  • 序列化在任何分布式程序的性能中承担着一个重要的角色。格式会很慢对于序列化对象或者会占用很大的bytes,都将拖累你的计算。
  • 通常来说,这将是你优化spark应用程序的第一件事。Spark的目的是在方便(允许你在操作中使用任何Java类型)和性能之间取得一个平衡,它提供了两种序列化的方式。
  1. Java serialization:By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extending java.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes
  • Java序列化:默认,Spark使用的只要实现java serializable接口

对于java来说:slow、large,所以引出kryo的方式。

  1. Kryo serialization:Spark can also use the Kryo library (version 4) to serialize objects more quickly.
    Kryo is significantly faster and more compact than Java serialization (often as much as 10X), but does not support all Serializable types and requires you to register the classes you will use in the program in advance for best performance.
  • 能够是的速度更快,更加紧凑(小) ,必须注册类,注册之后又快又紧凑;如果没有注册,也能使用,会慢一些,占用空间大一点

4.2 如何进行序列化使用

  • 需要写在代码中使用么?no,进入到这个目录:cd $SPARK_HOME/conf
    vi spark-defaults.conf 直接把默认注释掉的参数开启:#spark.serializer
    系统默认的配置:#spark.serializer hdfs://hadoop002:9000/g6_directory
    这是k v结构

  • 不在conf目录中配置的话,使用命令:./spark-submit --conf x=y 控制台启用的这个命令的优先级高于conf目录的

如何注册?

val SparkConf = new SparkConf()
		sparkConf.setAppName("MapPartitionApp").setMaster("local[2]")
		sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")				//序列化的方式
		sparkConf.registerKryoClasses(Array(classOf[info]))				//注册类

自行测试:

一份数据转变成RDD RDD.cache ? 默认MEMORY_ONLY
data ==> rdd.cache MEMORY_ONLY()
data ⇒ rdd.cache MEMORY_ONLY_SER()

data
spark.serializer rdd.cache() MEMORY_ONLY_SER
spark.serializer+register rdd.cache() MEMORY_ONLY_SER

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值