SparkSQL04

最新推荐文章于 2023-11-03 19:41:38 发布

HBinz

最新推荐文章于 2023-11-03 19:41:38 发布

阅读量673

点赞数

文章标签： BigData

本文链接：https://blog.csdn.net/Binbinhb/article/details/88607854

版权

一、自定义spark.read.format

在一个项目处理多种文件格式，并输出DF，ETL场景使用较多。

二、FunctionSpark内置函数

PvUv案例：

需求：每天每个用户观看的视频次数

1）将数组转RDD

2）RDD转DF/DS

//API做法:

3）按date分组，select user,date ,count(1) from xxx group by ser,date

1、源码

package com.HBinz.spark.Spark.sql.day06

import org.apache.spark.sql.SparkSession

object PvUvApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local[2]")
      .appName("PvUvApp")
      .getOrCreate()
    //造数据
    val log = Array(
      "2018-02-13,G301",
      "2018-02-13,G302",
      "2018-02-13,G303",
      "2018-02-13,G301",
      "2018-02-13,G301",
      "2018-02-13,G301",
      "2018-02-13,G301",
      "2018-02-13,G301",
      "2018-02-13,G301",
      "2018-02-13,G302",
      "2018-02-13,G303",
      "2018-02-13,G303",
      "2018-02-13,G302",
      "2018-02-13,G301",
      "2018-02-13,G301"
    )
    //每天每个用户观看的视频次数
    /*
    1）将数组转RDD
    2）RDD转DF/DS
    //API做法:
    3）按date分组，select user,date ,count(1) from xxx group by ser,date
     */
    import spark.implicits._
    val rdd = spark.sparkContext.parallelize(log)
    val df = rdd.map(_.split(",")).map(x=>Log(x(0),x(1))).toDF
    //df.show(false)
    import org.apache.spark.sql.functions._
    df.groupBy("date","user").agg(count("user").as("PV"))
      //排序
      .sort('PV.desc)
      .select("user","date","PV").show()



    spark.stop()
  }
case class Log(date:String,user:String)
}

2、.group方法

分组后再计算。

3、导入import org.apache.spark.sql.functions._

列求和。

4、后台

200哪里来？

默认在shuffle时使用的分区数

5、调整spark.sql.shuffl.partitions

spark.sql.shuffl.partitions参数需要因地制宜，不同大小使用不同的分区数

（1）通过shell设置

spark配置属性，通过keep-value设置

（2）--conf spark.sql.shuffle.partitions=10

./spark-shell --master local[2] --jars /opt/lib/mysql-connector-java-5.1.47.jar --conf spark.sql.shuffle.partitions=10

看后台

设置成功

6、UV

df.groupBy("date","user").agg(countDistinct("user").as("UV"))
  .select("user","date","UV").show()

三、自定义函数

val likes =spark.sparkContext.textFile("file:opt/data/hobbies.txt")
val likeDF = likes.map(_.split("\t")).map(x=>Like(x(0),x(1))).toDF()
//创建临时表
likeDF.createOrReplaceTempView("hobbies")
/*
自定义函数：
1）定义函数
2）使用函数
 */
//函数名likes_num，对字段内的数据结构拆分，以,分割
spark.udf.register("likes_num",(x:String)=>x.split(",").size)
//使用临时表做查询
spark.sql("select name,like,likes_num(like) from hobbies").show(false)
case class Like(name:String,like:String)

总结：

1、partition太大了会导致很多小文件。

2、partition太小了会出现很多性能问题。

3、SparkSQL愿景：

（1）写更少代码，内嵌很多方法，而且读外部数据源轻松

如JSON为例，每行的字段类型不一样等问题，SparkSQL都可以底层自动识别处理，无需写代码识别。

（2）读更少数据

列裁剪，过滤，缓存，PDD

实践需求：

case class Person(name:String, age:Int, salary:Double)

工资大于30000 只需要name，不需要age和salary

方法一（性能最低）：

sc.textFile(path).map(x=>{

 val Array(name,age,salary) = x.split(",")

 Person(name,age,salary)

}).map{

case Person(name,_,salary) => (name,salary)

}).filter(_._2 > 30000).map(_._1).collect

方法二（SparkSQL）：

select name

from (

 select name,salary from xxx

) t

where t.salary>30000;



select name

from (

 select name,salary from xxx where salary>30000

)

总结：

自动优化：基于Spark SQL，不是Core，Core很考验基本功。

考察RDDAPI，考察基本功。

（3）简单的优化交给底层

（4）Partition Discovery（分区探测）

Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically. For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra columns, gender and country as partitioning columns:

By passing path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL will automatically extract the partitioning information from the paths. Now the schema of the returned DataFrame becomes:

在像Hive这样的系统中，表分区是一种常用的优化方法。在分区表中，数据通常存储在不同的目录中，分区列值编码在每个分区目录的路径中。所有内置的文件源（包括Text/CSV/JSON/ORC/Parquet）都能够自动发现和推断分区信息。例如，我们可以使用下面的目录结构将以前使用的所有人口数据存储到分区表中，并使用两个额外的列（性别和国家）作为分区列：

通过将path/to/table传递给SparkSession.read.parquet或SparkSession.read.load，Spark SQL会自动从路径中提取分区信息。

PS：支持的分区类型为数字数据类型、日期、时间戳和字符串类型

Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the above example, if users pass path/to/table/gender=male to either SparkSession.read.parquet or SparkSession.read.load, gender will not be considered as a partitioning column. If users need to specify the base path that partition discovery should start with, they can set basePath in the data source options. For example, when path/to/table/gender=male is the path of the data and users set basePath to path/to/table/, gender will be a partitioning column.

从Spark1.6.0开始，默认情况下分区发现只在给定路径下查找分区。在上面的例子中，如果用户将path/to/table/gender=male传递给SparkSession.read.read.parquet或SparkSession.load，性别将不会被视为一个分区列。如果用户需要指定分区发现应该以哪一个基本路径开始，他们可以在数据源选项中设置base path。例如，当path/to/table/gender=male是数据的路径，并且用户将basePath设置为path/to/table/，则性别将是一个分区列。

四、Spark2.x

1、Catalog

历史原因：

Spark1.x的时候如果Spark要找metastore，得去MySQL里面找。所以：

Spark2.x出现了Catalog

（1）代码入口SparkSession.scala

（2）CatalogImpl(self)

其中有个列出数据库的方法，直接调用测试

val catalog = spark.catalog

catalog.listDatabases().show(false)

catalog.listDatabases().select("name").show(false)

总结：

当你想访问元数据信息的时候使用Catalog，需要数据库那一块信息，直接select（"xxx"）即可。

（3）查table

catalog.listTables("hive").show(false)

（4）查colums

catalog.listColumns("hive","hive_array").show(false)

2、DF VS DS

（1）外部数据源读CSV格式文件转DF

val csv = spark.read.format("csv").load("file:///opt/data/sales.csv")

header有问题

（2）优化

val csv = spark.read.format("csv").option("header","true").load("file:///opt/data/sales.csv")

（3）取某几列

csv.select("transactionId","amountPaid").show(false)

如果列名写错的话，就会报错找不到colum，这个时候DS的好处就来了。编译检查！！！！

3、DS

回顾lesson43

DS是强类型。强类型介绍：https://baike.baidu.com/item/%E5%BC%BA%E7%B1%BB%E5%9E%8B/5074514?fr=aladdin

（1）DF.as[]

case class Sales(transactionId:Int,customerId:Int,itemId:Int,amountPaid:Double)
val csvds = spark.read.format("csv").option("header","true").load("file:///opt/data/sales.csv").as[Sales]
csvds.select("transactionId","amountPaid")

报错：