SparkSQL-03

最新推荐文章于 2023-10-31 15:42:28 发布

大米饭精灵

最新推荐文章于 2023-10-31 15:42:28 发布

阅读量165

点赞数

分类专栏： Spark 文章标签： SparkSQL

本文链接：https://blog.csdn.net/qq_15300683/article/details/80531515

版权

Spark 专栏收录该内容

30 篇文章 0 订阅

订阅专栏

内容出处：http://spark.apache.org/docs/latest/sql-programming-guide.html

SparkSQL的三个愿景：

1.Less Code

a)可以自己推导schema（比如：直接读取json、Parquet，结构在数据文件中有）

Partition Discovery可以自己推导schema

Schema Merging（类似 ProtoBuffer Avro Thrift，灵活的添加删除列）

// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._

// Create a simple DataFrame, store into a partition directory
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF.write.parquet("data/test_table/key=1")

// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
cubesDF.write.parquet("data/test_table/key=2")

// Read the partitioned table
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
mergedDF.printSchema()

// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths
// root
//  |-- value: int (nullable = true)
//  |-- square: int (nullable = true)
//  |-- cube: int (nullable = true)
//  |-- key: int (nullable = true)

加等于

DF.write.parquet("path")这种写法是一个简写，通常开发的时候我们会：

DF.write.format("parquet").mode("overwrite").save("path")

或者

Import org.apache.spark.sql.SaveMode

DF.write.format("parquet").mode(SaveMode.Overwrite).save("path")

关于save mode的一些介绍

b)Catalyst 自动优化

2.Less Data

分析大数据最快的方法是什么？

答：只拿我们需要的，忽略过滤无关数据，用到的技术比如按时间分区，文件存储格式（ORC Parqurt）压缩，列式存储，过滤（where join的on），还有仅Parqurt格式支持的为谓词下推和映射

3.Let the optimizer do the hard work

举个例子：求工资大于30000，只需要name，不需要age和salary

case class Person(name:String, age:Int, salary:Double)
sc.textFile("")
.map(x=>split("\t"))
.map(x => Person(......))
.map(x=> (name, salary))
.filter(_._2 > 30000)
.map(_._1)

.collect

如何去优化？