Spark-core/SparkSQL 简单使用总结

yunpeng.zhou

已于 2024-07-12 16:07:35 修改

阅读量2.6k

点赞数 4

分类专栏：大数据数据分析文章标签： spark python 分布式大数据

于 2022-08-16 16:13:33 首次发布

本文链接：https://blog.csdn.net/a1314_521a/article/details/126368419

版权

大数据同时被 2 个专栏收录

9 篇文章 1 订阅

订阅专栏

数据分析

9 篇文章 0 订阅

订阅专栏

文章目录

一、Spark-core RDD常用算子总结

1、RDD对象特性

RDD 定义：（Resilient Distributed Dataset）

弹性分布式数据集，Spark中最基本的数据抽象，代表一个不可变、可分区、里面的元素可并行计算的集合。
RDD 5大特性：
1. RDD的分区是RDD数据存储的最小单位
2. RDD的方法会作用在其所有分区上
3. RDD 之间是有依赖关系的（RDD有血缘关系）
4. Key-Value型的RDD可以有自己的分区器
5. RDD的分区规划，分区会尽量靠近数据所在的服务器
RDD 程序编程入口：SparkContext
1. spark RDD 编程的程入口对象就是SparkContext对象（不论何种编程语言）,只有构建出SparkContext,基于它才能执行后续的API调用和计算，本质上，SparkContext对于编程来说，主要功能就是创建第一个RDD出来
RDD 构建
1. 通过并行化集合创建（本地对象list --> 分布式RDD）
2. 读取本地数据源或HDFS文件

2、RDD常用算子

2.1 SparkContext对象创建

# SparkContext对象创建
conf = SparkConf().setAppName('Spark core').setMaster('local[10]')
sc = SparkContext(conf=conf)
sc

# 对象详情
'''
SparkContext

Spark UI

Version    v3.3.0
Master     local[10]
AppName    Spark core
'''

2.2 RDD对象创建

## RDD对象创建
# 1. 通过本地list对象创建
rdd1 = sc.parallelize(c=[1,2,3,4,5],numSlices=3)
print(rdd1.glom().collect()) # 收集所有分区，汇总到driver端显示数据
print(rdd1.getNumPartitions()) # 获取RDD的分区数量

# 输出
'''
[[1], [2, 3], [4, 5]]
3
'''

# 2. 通过读取文件创建
# 2.1 sc.textFile api
rdd1 = sc.textFile('./data',minPartitions=None)# 参数2：最小分区数，一般不指定，spark有自己的合理划分， 
print(rdd1.collect()) # 读取路径下所有文件，每一行认为是一条记录
print(rdd1.getNumPartitions())

# 输出
'''
['hellow world', 'hollow python', 'hollow java']
3
'''

# 2.2 sc.wholeTextFiles api
rdd1 = sc.wholeTextFiles('./data',minPartitions=None)
print(rdd1.collect()) # 读取路径下所用文件，每个元素内容为2元组，k:文件路径，v：对应文件里所有内容
print(rdd1.getNumPartitions()) # 通常用于许多小文件的需求（small files are preferred,as each file will be loaded fully in memory）

# 输出
'''
[('file:/data/jupyter_lab/zyp/pyspark学习/data/wordcount1.txt', 'hellow world\nhollow python\nhollow java')]
1
'''

2.3 Transformation算子: 懒加载，只记录对rdd的操作，不实际执行，操作结果返回一个新的rdd

# 1. map: 对RDD内的每个元素进行map操作
rdd1 = sc.parallelize(range(10),3)
print(rdd1.collect())
print(rdd1.map(lambda x:x+1).collect())

# 输出：
'''
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
'''

# 2. flatMap：对RDD执行map操作后，再进行解除嵌套的操作
rdd1 = sc.parallelize([[1,2],[3,4],[5,6]])
print(rdd1.collect())
print(rdd1.flatMap(lambda x:x).collect())

# 输出：
'''
[[1, 2], [3, 4], [5, 6]]
[1, 2, 3, 4, 5, 6]
'''

# 3 mapValues: 对k-v型RDD中value进行map操作
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
rdd1.mapValues(lambda x:x+1).collect()

# 输出
'''
[('a',2),('a',2),('b',2),('b',2),('b',2)]
'''

# 4. reduceByKey :针对K-V型RDD，自动按照key进行分组，然后根据提供的聚合逻辑，完成组内数据（value）的聚合操作,返回聚合后的K-V值
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
print(rdd1.reduceByKey(lambda a,b:a+b).collect())

# 输出
'''
[('b', 3), ('a', 2)]
'''

# 5. groupBy :将RDD的数据根据指定规则进行分组,返回k-v型RDD（v:可迭代对象）
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
print(rdd1.groupBy(lambda x:x[0]).collect())

# 输出(输出后的value为一个可迭代对象)
'''
[('b', <pyspark.resultiterable.ResultIterable object at 0x2adb5a6a30d0>), ('a', <pyspark.resultiterable.ResultIterable object at 0x2adb5a6a9190>)]
'''

# 6. groupByKey: 针对KV型rdd,自动按照key进行分组(groupBy算子则没有此限定)
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
rdd1.groupByKey().collect()

# 输出
'''
[('b', <pyspark.resultiterable.ResultIterable at 0x2adb5a6d61f0>),
 ('a', <pyspark.resultiterable.ResultIterable at 0x2adb5a6d6e50>)]
'''

# 7. filter: 按给定规则对rdd中的数据进行过滤(和python filter高阶函数用法一致)
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
rdd1.filter(lambda x:True if x[0] == 'a' else False).collect()

# 输出
'''
[('a', 1), ('a', 1)]
'''

# 8. distinct:对RDD数据进行去重，返回新的RDD(k-v型数据也可以去重)
rdd1 = sc.parallelize([('a',1),('a',1),('b',1),('b',1),('b',1)])
rdd1.distinct().collect()

# 输出
'''
[('b', 1), ('a', 1)]
'''

# 9. union: 将2个rdd合并成1个rdd
rdd1 = sc.parallelize([('b',1),('b',1),('b',1)])
rdd2 = sc.parallelize([('a',1),('a',1)])
rdd2.union(rdd1).collect()

# 输出
'''
[('a', 1), ('a', 1), ('b', 1), ('b', 1), ('b', 1)]
'''

# 10. intersection: 求2个rdd的交集
rdd1 = sc.parallelize(range(10))
rdd2 = sc.parallelize(range(5))
rdd1.intersection(rdd2).collect()

# 输出
'''
[0, 1, 2, 3, 4]
'''

# 11. join:对2个rdd执行joi操作,型数据k-v型数据（相当于sql的内连接）
rdd1 = sc.parallelize([('name','张三'),('sex','男'),('age',19),('love','足球')])
rdd2 = sc.parallelize([('name','李四'),('sex','女'),('age',12)])
print(rdd1.join(rdd2).collect())

# 输出
'''
[('name', ('张三', '李四')), ('sex', ('男', '女')), ('age', (19, 12))]
'''

# 12. leftOuterJoin:左外连接 ；rightOuterJoin:右外连接
rdd1 = sc.parallelize([('name','张三'),('sex','男'),('age',19),('love','足球')])
rdd2 = sc.parallelize([('name','李四'),('sex','女'),('age',12)])
rdd1.leftOuterJoin(rdd2).collect()

# 输出
'''
[('name', ('张三', '李四')),
 ('sex', ('男', '女')),
 ('age', (19, 12)),
 ('love', ('足球', None))]
'''

# 13. glom: 将rdd的数据，加上嵌套，这个嵌套按照分区来进行
rdd1 = sc.parallelize(range(10),3)
rdd1.glom().collect()

# 输出
'''
[[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]
'''

# 14 sortBy:对rdd数据按照指定规则进行排序
# 语法：rdd.sortBy(func,ascending=False,numPartitions=1) func: 排序规则定义;ascending:False降序；numPartitions:排序后的分区
rdd1 = sc.parallelize(range(10),3)
print(rdd1.glom().collect())
print(rdd1.sortBy(lambda x:x,ascending=False,numPartitions=2).glom().collect())

# 输出
'''
[[0, 1, 2], [3, 4, 5], [6, 7, 8, 9]]
[[9, 8, 7, 6], [5, 4, 3, 2, 1, 0]]
'''

# 15. sortByKey:针对KV型rdd，按照key进行排序
# 语法：rdd.sortByKey(ascending=True,numPartitions=1,keyfunc) keyfunc:在排序前对key进行处理
rdd1 = sc.parallelize([('a',1),('d',1),('c',2),('b',1),('b',1),('b',1)])
rdd1.sortByKey().collect()

# 输出
'''
[('a', 1), ('b', 1), ('b', 1), ('b', 1), ('c', 2), ('d', 1)]
'''

2.4 Action 算子: 激活rdd相关所有计算，返回一个值或一个结果（保存文件）

# 1. collect: 将rdd各个分区的数据，统一收集到driver中，形成一个list对象（注意：数据量会不会把driver内存撑爆）
rdd1 = sc.parallelize([1,2,3,4,5],2)
type(rdd1.glom().collect()) >> list


# 2. count : 统计rdd有多少元素,返回一个数值
rdd1 = sc.parallelize([('a',1),('d',1),('c',2),('b',1),('b',1),('b',2)])
print(rdd1.count())   >> 6

# 3. countByKey: 统计key出现的次数（一般适用于kv型rdd）
rdd1 = sc.parallelize([('a',1),('d',1),('c',2),('b',1),('b',1),('b',2)])
print(rdd1.countByKey)
      
# 输出
'''
defaultdict(<class 'int'>, {'a': 1, 'd': 1, 'c': 1, 'b': 3})
'''

# 4. reduce:对rdd数据集按照规则进行聚集
rdd1 = sc.parallelize([1,2,3,4,5],2)
rdd1.reduce(lambda a,b :a+b) >> 15

# 5. fold: 和reduce一样，对rdd数据集进行聚合，只不过带有初始值(初始值作用在:分区内聚合，分区间聚合)
rdd1 = sc.parallelize([1,2,3,4,5],2)
rdd1.fold(zeroValue=10,op=lambda a,b:a+b) >> =10+(10+1+2)+(10+3+4+5)=45

# 6. first:取出rdd的第一个元素
rdd1 = sc.parallelize([1,2,3,4,5],2)
print(rdd1.first()) >> 1

# 7. take(N): 取出rdd的前N个元素，组成list返回
print(rdd1.take(3)) >> [1,2,3]

# 8. top(N):　对rdd数据集进行降序排序后，取出前n个
print(rdd1.top(3))  >> [5, 4, 3]

# 9. takeSample: 随机抽样RDD的数据
# 语法：takeSample(withReplacement:True/False(是否可以重复抽取),num:抽样数,seed: 随机种子)
rdd1 = sc.parallelize(range(100),5)
rdd1.takeSample(False,10)

# 输出
'''
[85, 17, 40, 80, 12, 63, 70, 96, 43, 33]
'''

# 10. takeOrdered: 对rdd进行排序取前n个（与top类似，但可以指定排序规则）
rdd1 = sc.parallelize((1,3,6,7,3,4,5,8,2,0))
rdd1.takeOrdered(num=3,key=lambda x:x) >> [0,1,2]

# 11. foreach: 对rdd每一个元素，执行相同操作，类似map，但是没有返回值
rdd1 = sc.parallelize(range(5),2)
rdd1.foreach(lambda x: print(x))

# 12. saveAsTextFile: 将rdd数据写入文本文件(支持本地,hdfs等)
rdd1.saveAsTextFile(path='./data/2222') # 路径指定不存在文件夹

## Note: foreach 和 saveTextFile 执行数据结果不返回drive,操作结果直接映射分区所在worker

2.5 分区操作算子: 算子操作的不是单个元素，而是一个个分区

# 1. mapPartitions: 与map一样，只不过迭代的是一个个整体数据分区
rdd1 = sc.parallelize([1,2,3,4,5,6,7],3)
print(rdd1.collect())
def f(x): yield sorted(x,reverse=True)
print(rdd1.mapPartitions(f).glom().collect())

# 输出

  '''
  [1, 2, 3, 4, 5, 6, 7]
  [[[2, 1]], [[4, 3]], [[7, 6, 5]]]
  '''

# 2. foreachPartition：没有返回值的mapPartitions,且执行的数据结果不返回driver

rdd1 = sc.parallelize([1,2,3,4,5,6,7],3)
print(rdd1.collect())
rdd1.foreachPartition(lambda x:print(x))

# 输出

  '''
  [1, 2, 3, 4, 5, 6, 7]
  '''

# 3. partitionBy: 对rdd进行自定义分区(K-V型数据)

rdd1 = sc.parallelize([1,2,3,4,5,6,7],3).map(lambda x:(x,x))
print(rdd1.glom().collect())
rdd1.partitionBy(2,lambda x: 0 if x>3 else 1).glom().collect() # 参数1:重新分区数目；参数2：每个元素分区编号

# 输出

  '''
  [[(1, 1), (2, 2)], [(3, 3), (4, 4)], [(5, 5), (6, 6), (7, 7)]]
  [[(4, 4), (5, 5), (6, 6), (7, 7)], [(1, 1), (2, 2), (3, 3)]]
  '''
    
# 4. repartition: 仅在数量上对分区进行重新分区（为避免shuffle增加，尽量分区少，一般不调整）

rdd1 = sc.parallelize([1,2,3,4,5,6,7],3)
print(rdd1.glom().collect())
print(rdd1.repartition(2).glom().collect())

# 输出

  '''
  [[1, 2], [3, 4], [5, 6, 7]]
  [[1, 2, 5, 6, 7], [3, 4]]
  '''

# 5. coalesce: 对分区进行数量增减

# rdd1.coalesce(numPartitions:重新分区数,shuffle:True/False 是否允许增加分区)

rdd1 = sc.parallelize([1,2,3,4,5,6,7],3)
print(rdd1.glom().collect())
print(rdd1.coalesce(2).glom().collect()) # repartition == coalesce(n,shuffle=True)

# 输出

  '''
  [[1, 2], [3, 4], [5, 6, 7]]
  [[1, 2], [3, 4, 5, 6, 7]]
  '''

3、RDD优化缓存

3.1 cache与checkpoint

RDD的数据是过程数据

RDD之间进行相互迭代计算（Transformation）,当执行开启后（action）,新的RDD的生成，代表着老RDD的消失
旧的消失的RDD再次被引用的时候，只能基于血缘关系，从rdd1重新开始执行，直到产生所需RDD

对过程数据进行优化：对相关RDD进行缓存

cache()/persist(); unpersist() # 每个分区自行将其数据保存在其所在的executor内存或硬盘上，是分散存储;速度快
- rdd.cache() # 缓存到内存中
- rdd.persist(StorageLevel.MEMORY_ONLY) # 仅缓存到内存中
- rdd.persist(StorageLevel.MEMORY_ONLY_2) # 仅缓存到内存中，2个副本
- rdd.persist(StorageLevel.DISK_ONLY) # 仅缓存到硬盘
- rdd.persist(StorageLevel.DISK_ONLY_2) # 仅缓存到硬盘上，2个副本
- rdd.persist(StorageLevel.DISK_ONLY_3) # 仅缓存到硬盘上，3个副本
- rdd.persist(StorageLevel.MEMORY_AND_DISK) # 先放内存，不够放硬盘
- rdd.persist(StorageLevel.MEMORY_AND_DISK_2) # 先放内存，不够放硬盘，2个副本
- rdd.persist(StorageLevel.MEMORY_ONLY_2) # 仅缓存到内存中，2个副本rdd.unpersist() # 清理缓存
checkpoint # 将rdd的数据,保存起来，仅支持硬盘存储；集中收集各个分区的数据进行存储hdfs/local；安全性高
- sc.setCheckpointDir(‘hdfs://主机:8020/spark_backpu’) # 设置cp存储路径 ,执行缓存前设置
- rdd.checkpoint()

注意：

CheckPoint是一种重量级的使用，当rdd重新计算成本很高或数据量很大时，采用；数据量比较小时，采用缓存。
cache和checkpoint都不是action类型；需要让rdd接上action算子后有数据后，进行保存

# cache()/persist() : 缓存算子
##1. 未使用缓存前
rdd1 = sc.parallelize([1])

# 声明一个累加器变量，记录中间rdd的执次数
value = sc.accumulator(0)

def f(x):
    global value # 子任务执行到此时，会像driver,copy一份value
    value += 1
    return x

rdd2 = rdd1.map(f)
rdd2.count()
rdd3 = rdd2.map(lambda x:x+1)
rdd3.collect()
print(f'未使用缓存，计算rdd3时，rdd2执行次数 value: {value}')

# 输出
'''
为使用缓存，计算rdd3时，rdd2执行次数 value: 2
'''

##2. 使用缓存
rdd1 = sc.parallelize([1])

# 声明一个累加器变量，记录中间rdd的执次数
value = sc.accumulator(0)

def f(x):
    global value # 子任务执行到此时，会像driver,copy一份value
    value += 1
    return x

rdd2 = rdd1.map(f)
rdd2.cache()
rdd2.count() # cache不是acion算子，这里count算子的作用是触发缓存执行


rdd3 = rdd2.map(lambda x:x+1)
rdd3.collect()
print(f'对rdd2进行缓存后，计算rdd3时，rdd2执行次数 value: {value}')
#输出
'''
对rdd2进行缓存后，计算rdd3时，rdd2执行次数 value: 1
'''

rdd2.unpersist() # 清理缓存

4、RDD共享变量与累加器

4.1 共享变量（广播变量与累加器）

广播变量（用于本地list对象，与rdd对象进行交互场景）
Executor是一个进程，进程内资源共享，将本地list对象包装成广播变量，spark只会给每一个executor一份本地list数据，每一个executor内的多个task线程共享一份list数据，而不是原来那样每个task
执行时向driver单独申请一份list，节省内存；
1. 标记广播变量
  broadcast = sc.broadcast(loacl_list)
2. 使用广播变量
  value = broadcast.value (task线程使用local_list时，spark会自行检测executor是否存在广播，有了就不再传递)
累加器（在rdd分布式计算中，声明一个全局变量）
acmlt = sc.accumulator(init_value) # 定义一个累加器变量，每一个executor对其进行的操作共享

##1. 未使用累加器
rdd1 = sc.parallelize(range(5),5)
init_value = 0

# 对init_value 进行累加操作
def f(x):
    global init_value
    init_value += 1
    return x

rdd2 = rdd1.map(f)

print(rdd2.collect())
print(f'未使用累加器前 init_value ：{init_value}')  # 每一个executor中，对init_value的操作 并没有传递给driver中的init_value

# 输出
'''
[0, 1, 2, 3, 4]
未使用累加器前 init_value ：0
'''

##2. 使用累加器
rdd1 = sc.parallelize(range(5),5)
init_value = 0
# 声明累加器
init_value = sc.accumulator(init_value)

# 对init_value 进行累加操作
def f(x):
    global init_value
    init_value += 1
    return x

rdd2 = rdd1.map(f)

print(rdd2.collect())
print(f'使用累加器后 init_value ：{init_value}')  # 每一个executor中，对init_value的操作，共享,

# 输出
'''
[0, 1, 2, 3, 4]
使用累加器后 init_value ：5
'''

##1. 不使用广播变量
local_list = dict([(1,'小明'),(2,'小红')])
rdd1 = sc.parallelize([(1,98),(2,99)],2)

# 名称替换
def f(x):
    name = ''
    if x[0] in local_list:
        name = local_list.get(x[0])
    return name,x[1]

rdd2 = rdd1.map(f)
print(rdd2.collect()) # 不用广播变量，程序也能执行，只不过每个task都得申请一份local_list

# 输出
'''
[('小明', 98), ('小红', 99)]
'''

##2. 使用广播变量

local_list = dict([(1,'小明'),(2,'小红')])
# 声明广播变量
local_broadcast = sc.broadcast(local_list)
rdd1 = sc.parallelize([(1,98),(2,99)],2)

# 名称替换
def f(x):
    name = ''
    # 使用广播变量
    if x[0] in local_broadcast.value:
        name = local_broadcast.value.get(x[0])
    return name,x[1]

rdd2 = rdd1.map(f)
print(rdd2.collect()) # 使用广播变量，每一个Executor中的多个task线程共享一份local_list

# 输出
'''
[('小明', 98), ('小红', 99)]
'''

5、RDD全局并行度设置

配置文件中设置: conf/spark-defaults.conf: spark.default.parallelism 100
提交任务时：spark-submit --conf “spark.default.parallelism=100”
在代码中设置：SparkConf().set(‘spark.default.parallelism’,‘100’)

二、SparkSQL总结

1、SparkSQL DataFrame构建

# 1. 导包
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StringType,IntegerType,FloatType,ArrayType
import  pyspark.sql.functions as F # DataFrame 函数包 （F包中函数输入column对象，返回一个column对象）
import pandas as pd
import numpy as np

# 2. 添加 java 环境(使用python类库pyspark)
import os
os.environ['JAVA_HOME'] = '/data/app/jdk1.8.0_333/'

# 3.构建SparkSession对象
spark = SparkSession.builder.appName('test').getOrCreate()

## DataFrame 构建
# 1. 基于RDD进行构建 
    # 1.1 使用 spark.createDataFrame(rdd,schema=)创建
    rdd = spark.sparkContext.textFile('./data/students_score.txt')
    rdd = rdd.map(lambda x:x.split(',')).map(lambda x:[int(x[0]),x[1],int(x[2])]) 
    print(rdd.collect())
    '''[[11, '张三', 87], [22, '李四', 67], [33, '王五', 79]]'''

    # 方式1：schema 只指定列名,类型靠推断，是否允许为空默认是True
    df = spark.createDataFrame(data=rdd,schema=['id','name','score'])
    df.show() # 默认展示前20行数据
    df.printSchema() # 查看表结构
    ''' 
    +---+----+-----+
    | id|name|score|
    +---+----+-----+
    | 11|张三|   87|
    | 22|李四|   67|
    | 33|王五|   79|
    +---+----+-----+

    root
     |-- id: long (nullable = true)
     |-- name: string (nullable = true)
     |-- score: long (nullable = true)
    '''

    # 方式2：schema 指定为 StructType表结构对象
    schema =  StructType()\
              .add(field='id',data_type=IntegerType(),nullable=True)\
              .add(field='name',data_type=StringType(),nullable=True)\
              .add(field='score',data_type=IntegerType(),nullable=False)

    df = spark.createDataFrame(data=rdd,schema=schema)
    df.show() # 默认展示前20行数据
    df.printSchema() # 查看表结构

    '''
    +---+----+-----+
    | id|name|score|
    +---+----+-----+
    | 11|张三|   87|
    | 22|李四|   67|
    | 33|王五|   79|
    +---+----+-----+

    root
     |-- id: integer (nullable = true)
     |-- name: string (nullable = true)
     |-- score: integer (nullable = false)


    '''
    # 1.2 rdd.toDF() 创建
    rdd = spark.sparkContext.textFile('./data/students_score.txt')
    rdd = rdd.map(lambda x:x.split(',')).map(lambda x:[int(x[0]),x[1],int(x[2])])
    print(rdd.collect())

    df =  rdd.toDF(schema=['id','name','score']) # schema 同样可以只填列名list或structType对象

    df.show() # 默认展示前20行数据
    df.printSchema() # 查看表结构
    
    '''
    [[11, '张三', 87], [22, '李四', 67], [33, '王五', 79]]
    +---+----+-----+
    | id|name|score|
    +---+----+-----+
    | 11|张三|   87|
    | 22|李四|   67|
    | 33|王五|   79|
    +---+----+-----+

    root
     |-- id: long (nullable = true)
     |-- name: string (nullable = true)
     |-- score: long (nullable = true)
    '''

# 2. 基于pandas df进行构建:将pandas的dataFrame对象转变为分布式的dataset
    pd_data = pd.DataFrame({'id':[1,2,3],'name':['张三','李四','王五'],
                            'score':[65,35,89]})
    df = spark.createDataFrame(pd_data)

    df.printSchema()
    df.show()
    
    
    '''
    root
     |-- id: long (nullable = true)
     |-- name: string (nullable = true)
     |-- score: long (nullable = true)

    +---+----+-----+
    | id|name|score|
    +---+----+-----+
    |  1|张三|   65|
    |  2|李四|   35|
    |  3|王五|   89|
    +---+----+-----+
    '''
    
    
# 3. 基于数据文件读取进行构建
    # 方式1： 使用统一API进行数据读取
    # 用法： 
    '''  sparksession.read.format('text|csv|json|parquet|orc|jdbc|...')\
              .option('k','v')\ # 读取时的参数选项，如scv中的seq; jdbc中的数据库连接参数
              .schema(string|StructType对象)\ #string写法："id INT,name STRING,score INT"
              .load(localpath|hdfs)|.csv()'''
    # 方式2：直接指定文件类型读取，sparksession.read.csv(path)
    
    # 3.1 读取text文件,会把整个文件当成一列，默认列名称为value, 使用schema修改列名
    df = spark.read.format('text')\
          .schema("data_value STRING")\
          .load('./test.txt')
    print(f'读取txt文件方式1: ')
    df.show()

    df = spark.read.schema("data_value STRING").text('./test.txt',wholetext=False)
    print(f'读取txt文件方式2: ')
    df.show()
    
    '''
    读取txt文件方式1: 
    +-------------+
    |   data_value|
    +-------------+
    | hellow world|
    |hellow python|
    |  hellow java|
    +-------------+

    读取txt文件方式2: 
    +-------------+
    |   data_value|
    +-------------+
    | hellow world|
    |hellow python|
    |  hellow java|
    +-------------+
    '''
    # 3.2 读取json文件,本身带有字段信息，可以不用写schema
    df = spark.read.format('json')\
                   .load('./data/test_data/test_data/sql/people.json')

    print(f'读取json文件方式1: ')
    df.show()

    df = spark.read.json('./data/test_data/test_data/sql/people.json')
    print(f'读取json文件方式2: ')
    df.show()
    '''
    读取json文件方式1: 
    +----+-------+
    | age|   name|
    +----+-------+
    |null|Michael|
    |  30|   Andy|
    |  19| Justin|
    +----+-------+

    读取json文件方式2: 
    +----+-------+
    | age|   name|
    +----+-------+
    |null|Michael|
    |  30|   Andy|
    |  19| Justin|
    +----+-------+
    '''
    
    # 3.3 读取csv文件,表格数据，需指定分隔符，表头等参数
    option_dict = {'sep':';','header':True,'encoding':'utf-8'}
    df = spark.read.format('csv')\
                   .options(**option_dict)\
                   .load('./data/people.csv')

    print(f'读取csv文件方式1: ')
    df.show()

    df = spark.read.csv('./data//people.csv',sep=';',header=True,encoding='utf-8')   
    print(f'读取csv文件方式2: ')
    df.show()
    '''
    读取csv文件方式1: 
    +-----+----+---------+
    | name| age|      job|
    +-----+----+---------+
    |Jorge|  30|Developer|
    |  Bob|  32|Developer|
    |  Ani|  11|Developer|
    +-----+----+---------+
    读取csv文件方式2: 
    +-----+----+---------+
    | name| age|      job|
    +-----+----+---------+
    |Jorge|  30|Developer|
    |  Bob|  32|Developer|
    |  Ani|  11|Developer|
    +-----+----+---------+
    '''
    # 3.4 读取sql数据表格格
    df.createTempView('tt') # 创建临时表
    df = spark.read.table(tableName='tt') 
    spark.catalog.dropTempView('tt') # 清理临时表

    print(f'读取sql数据表: ')
    df.show()
    '''
    读取sql数据表: 
    +-----+----+---------+
    | name| age|      job|
    +-----+----+---------+
    |Jorge|  30|Developer|
    |  Bob|  32|Developer|
    |  Ani|  11|Developer|
    +-----+----+---------+
    '''
    
    # 3.5 读取parquet数据:列式存储，内置schema,序列化存储体积小

    df = spark.read.format('parquet')\
                   .load('./data/users.parquet')

    print(f'读取parquet文件方式1: ')
    df.show()

    df = spark.read.parquet('./data/users.parquet')
    print(f'读取parquet文件方式2: ')
    df.show()
    
    '''
    读取parquet文件方式1: 
    +------+--------------+----------------+
    |  name|favorite_color|favorite_numbers|
    +------+--------------+----------------+
    |Alyssa|          null|  [3, 9, 15, 20]|
    |   Ben|           red|              []|
    +------+--------------+----------------+

    读取parquet文件方式2: 
    +------+--------------+----------------+
    |  name|favorite_color|favorite_numbers|
    +------+--------------+----------------+
    |Alyssa|          null|  [3, 9, 15, 20]|
    |   Ben|           red|              []|
    +------+--------------+----------------+
    '''

2、SparkSQL DataFrame数据处理代码风格

## DataFrame数据处理代码风格
pd_data = pd.DataFrame({'id':[1,2,3],'name':['张三','李四','王五'],'score':[65,35,89]})
df = spark.createDataFrame(pd_data)

# 1. DSL: dataset language 就是dataframe 特有API
    # 1.1 df.show(): 打印dataframe 参数 n:显示行数，默认20；
                     # truncate:字段字符长度是否截断，默认输出20个字符
    df.show(n=20,truncate=True)
    '''
    +---+----+-----+
    | id|name|score|
    +---+----+-----+
    |  1|张三|   65|
    |  2|李四|   35|
    |  3|王五|   89|
    +---+----+-----+
    '''
    
    # 1.2 df.printSchema(): 打印输出df 的表结构信息
    df.printSchema()
    '''
    root
     |-- id: long (nullable = true)
     |-- name: string (nullable = true)
     |-- score: long (nullable = true)
    '''
    
    # 1.3 df.select(): 选择 df 中指定的列. 参数可以是 column对象、str、list[str]、list[column对象]
    df.select('name').show()
    df.select(df['name']).show() # df['name'] 返回 Column对象
    '''
    +----+
    |name|
    +----+
    |张三|
    |李四|
    |王五|
    +----+

    +----+
    |name|
    +----+
    |张三|
    |李四|
    |王五|
    +----+
    '''
    
    # 1.4 df.filter()|df.where() :按照过滤df中的数据，返回一个新df;类似pandas query()
    df.filter('score > 60').show()
    df.filter(df['score']>60).show()
    df.where('score > 60').show()
    df.where(df['score']>60).show()
    '''
    +---+----+-----+
    | id|name|score|
    +---+----+-----+
    |  1|张三|   65|
    |  3|王五|   89|
    +---+----+-----+
    '''
    
    # 1.5 df.groupBy() 分组，返回GroupedData对象
    pd_data = pd.DataFrame({'id':[1,2,3,4,5,6],
                            'name':['张三','李四','王五','张三','李四','王五'],
                            'score':[65,35,89,34,67,97]})
    df = spark.createDataFrame(pd_data)
    df.groupBy('name').sum().show()
    '''
    +----+-------+----------+
    |name|sum(id)|sum(score)|
    +----+-------+----------+
    |张三|      5|        99|
    |李四|      7|       102|
    |王五|      9|       186|
    +----+-------+----------+
    '''
    # 1.6 df.first() : 取出df第一行，返回Row对象
    df = spark.read.schema('word STRING').text('./data/test_data/test_data/words.txt')
    df.show()
    print(df.first()['word'])  # Row 对象没有show函数
    '''
    +------------+
    |        word|
    +------------+
    | hello spark|
    |hello hadoop|
    | hello flink|
    +------------+
    
    hello spark
    '''

    # 1.7 df.limit() : 返回df指定行数据，同sql limit
    df.limit(2).show() 
    '''
    +------------+
    |        word|
    +------------+
    | hello spark|
    |hello hadoop|
    | hello flink|
    +------------+
    '''
    # 1.8 F.split() :字符串切分函数
    df.select(F.split(df['word'],' ')).show()
    '''
    +------------------+
    |split(word,  , -1)|
    +------------------+
    |    [hello, spark]|
    |   [hello, hadoop]|
    |    [hello, flink]|
    +------------------+
    '''

    # 1.9 F.explode() : 类似pandas的explode，字符串列表纵向扩展
    df.select(F.explode(F.split(df['word'],' '))).show()
    '''
    +------+
    |   col|
    +------+
    | hello|
    | spark|
    | hello|
    |hadoop|
    | hello|
    | flink|
    +------+
    '''
    
    # 1.10 df.withColumn(): 对老的列进行操作，返回新列，新列名重复，就发生替换，不一致就扩展一个新列
    df1 = df.withColumn(colName='word',col=F.explode(F.split(df['word'],' ')))
    df1.groupBy(df1['word']).count().show()
    '''
    +------+-----+
    |  word|count|
    +------+-----+
    | hello|    3|
    | spark|    1|
    | flink|    1|
    |hadoop|    1|
    +------+-----+
    '''

    # 1.11 df.withColumnRenamed() : 修改列名 
    df1.groupBy(df1['word']).count().withColumnRenamed('count','cnt').show()
    '''
    +------+---+
    |  word|cnt|
    +------+---+
    | hello|  3|
    | spark|  1|
    | flink|  1|
    |hadoop|  1|
    +------+---+
    '''

    # 1.12 df.orderBy(): 排序
    df1.groupBy(df1['word']).count().orderBy('count').show()
    '''
    +------+-----+
    |  word|count|
    +------+-----+
    | spark|    1|
    |hadoop|    1|
    | flink|    1|
    | hello|    3|
    +------+-----+
    '''
    
    # 1.13 F.min、F.max、F.round,F.avg,column.alias：给列对象起别名，相当于 sql 中 as
    df.groupBy('name').agg(F.min('score').alias('min_'),
                      F.max('score').alias('max_'),
                      F.round(F.avg('score')).alias('round_avg')).show()
    '''
    +----+------+----+---------+
    |name|  min_|max_|round_avg|
    +----+------+----+---------+
    |张三|  34.0|65.4|     50.0|
    |李四|  35.2|67.0|     51.0|
    |王五|89.034|97.0|     93.0|
    +----+------+----+---------+
    '''
    
# 2. SQL: 使用sql处理dataFrame 数据

    df.createTempView('tt')
    spark.sql('select name,sum(score) from tt group by name').show()
    spark.catalog.dropTempView('tt')
    '''
    +----+----------+
    |name|sum(score)|
    +----+----------+
    |张三|        99|
    |李四|       102|
    |王五|       186|
    +----+----------+
    '''

3、SparkSQL DataFrame 数据清洗API

# 1. df.dropDuplicate() :数据去重，无参数按整理去重；也可指定列去重
pd_data = pd.DataFrame({'name':['张三','李四','王五','张三','李四','王五']
                        ,'score':[65,35,89,65,67,97]})
df = spark.createDataFrame(pd_data)
df.show()
df.dropDuplicates().show()
df.dropDuplicates(['name']).show()

'''
+----+-----+
|name|score|
+----+-----+
|张三|   65|
|李四|   35|
|王五|   89|
|张三|   65|
|李四|   67|
|王五|   97|
+----+-----+

+----+-----+
|name|score|
+----+-----+
|张三|   65|
|李四|   35|
|王五|   89|
|李四|   67|
|王五|   97|
+----+-----+

+----+-----+
|name|score|
+----+-----+
|张三|   65|
|李四|   35|
|王五|   89|
+----+-----+

'''

#2. df.dropna():  pandas 基本一致
import numpy as np
# df.dropna() : 缺失行删除,默认how='any' 只要本行一列为空就删除；若how='all',本行全部为空才会删除
pd_data = pd.DataFrame({'name':['张三','李四','王五','张三',None,None],'score':[65,35,np.nan,65,67,97]})
df = spark.createDataFrame(pd_data)
df.show()
print("dropna(how='any'):")
df.dropna().show()

print("dropna(how='all'):")
df.dropna(how='all').show()

# thresh=n参数：指定有效列数，至少n列不为空，才不会删除行，此时how参数不起作用;
# subset参数：指定参与空值删除的列
print("dropna(thresh=1,subset=['name']):")
df.dropna(thresh=1,subset=['name']).show()

'''
+----+-----+
|name|score|
+----+-----+
|张三| 65.0|
|李四| 35.0|
|王五|  NaN|
|张三| 65.0|
|null| 67.0|
|null| 97.0|
+----+-----+

dropna(how='any'):
+----+-----+
|name|score|
+----+-----+
|张三| 65.0|
|李四| 35.0|
|张三| 65.0|
+----+-----+

dropna(how='all'):
+----+-----+
|name|score|
+----+-----+
|张三| 65.0|
|李四| 35.0|
|王五|  NaN|
|张三| 65.0|
|null| 67.0|
|null| 97.0|
+----+-----+

dropna(thresh=1,subset=['name']):
+----+-----+
|name|score|
+----+-----+
|张三| 65.0|
|李四| 35.0|
|王五|  NaN|
|张三| 65.0|
+----+-----+

'''

# 3.df.fillna(): pandas 基本一致
import numpy as np
pd_data = pd.DataFrame({'':range(6),'name':['张三','李四','王五','张三',None,None],'score':[65,35,np.nan,65,67,97]})
df = spark.createDataFrame(pd_data)
df.show()

# df.fillna(value='待填充的值',subset=[指定要进行填充操作列]) : 填充缺失值
print("所有空值填充")
df.fillna(value='loss').show()
df.fillna(value='loss').printSchema()  # 数值填充字符串，并没有填充上

print("指定列，分列分别填充:")
df.fillna(value={'name':'无名','score':0},subset=['name','score']).show()

4、SparkSQL DataFrame 注册成表

# DataFrame 注册成表
df.createTempView('tt') # 注册临时视图（表）
df.createOrReplaceTempView('tt') # 注册临时视图（表）,如果存在进行替换
df.createGlobalTempView('tt') # 注册一个全局表，在一个程序内的多个sparkSession均可调用此表，查询时带上前缀：global_temp

spark.catalog.dropTempView('tt') # 直接删除表 或 spark.stop()后自动删除

5、SparkSQL 数据写出

# 方式1： 统一API语法：
# df.write.mode().format().option(K,V).save()
# mode: 传入模式字符串,append 追加；overwrite 覆盖；ignore 重复数据忽略；error 重复就报错(默认的)；
# format: 传入格式字符串，可选： text,csv,json,parquet(默认),orc,avro,jdbc # 注意：text 只支持单列写入
# option: 设置保存属性
# save： 写出路径，支持本地路径和HDFS

# 方式2： 直接制定文件保存格式   如：df.write.scv()

pd_data = pd.DataFrame({'id':[1,2,3,4,5,6],
                        'name':['张三','李四','王五','张三','李四','王五'],
                        'score':[65.4,35.2,89.034,34,67,97]})
df = spark.createDataFrame(pd_data)
df.show()

# 注意文件保存路径为文件夹所在路径
# 1. 写入csv文件 
df.write.csv(path='/data/write_csv',
             mode='overwrite',sep=',',header=True,encoding='utf-8')

# 2. 写入text文件，只能写入column对象
df.select(F.concat_ws(',',df['id'],df['name'],df['score']))\
     .write.mode('overwrite').text('/data/write_text')

# 3. json写出
df.write.json(path='/data/pyspark学习/data/write_json'
              ,mode='overwrite',encoding='utf-8')

# 4. parquet写出（sparksql 默认保存方式，列式存储，有助于sparksql优化，列值裁剪操作）
df.write.mode('overwrite').save('/data/write_parquet')

# 5. 读取和写入mysql
# 5.1 将mysql驱动放到pyspark/jars下
options = {'user':'xxxx','password':'xxx'}
df.write.options(**options)\
   .jdbc(url='jdbc:mysql://host_ip/database?useSSL=false&useUnicode=true'
         ,table='test_stu',mode='overwrite')

# 5.2 读取mysql表
spark.read.options(**options)\
.jdbc(url='jdbc:mysql://host_ip/databaseuseSSL=false&useUnicode=true'
      ,table='test_stu').show()

6、SparkSQL 定义udf函数

定义方式1：

sparksession.udf.register()
注册的udf 可以用于DSL风格和sql风格；返回值用于DSL风格，参数内的name参数值用于SQL风格

语法：
    udf = sparksession.udf.register(name,f,returnType)
参数：    
    name: UDF名称，可用于SQL风格
    f: 需要定义的函数名
    returnType: 声明UDF的返回值类型
    udf：返回的udf对象，可用于DSL风格处理数据

定义方式2：

pyspark.sql.functions.udf
仅能用于DSL风格

语法：
    udf = F.udf(f,returnType) 
参数：    
    f: 需要定义的函数名
    returnType: 声明UDF的返回值类型
    udf：返回的udf对象，可用于DSL风格处理数据

# 1. 方式1: 声明注册UDF 函数（返回FloatType）
pd_data = pd.DataFrame({'id':[1,2,3,4,5,6],'name':['张三','李四','王五','张三','李四','王五'],'score':[65.4,35.2,89.034,34,67,97]})
df = spark.createDataFrame(pd_data)
df.show()

def num_r_10(x):
    return x*10
num_r_10_udf = spark.udf.register(name='sql_num_r_10',f=num_r_10,returnType=FloatType())

# DSL风格使用
df.select(num_r_10_udf(df['score'])).show()

# SQL 风格使用
df.createTempView('tt')
spark.sql('select id,name,sql_num_r_10(score) from tt').show()
spark.catalog.dropTempView('tt')

'''
+---+----+------+
| id|name| score|
+---+----+------+
|  1|张三|  65.4|
|  2|李四|  35.2|
|  3|王五|89.034|
|  4|张三|  34.0|
|  5|李四|  67.0|
|  6|王五|  97.0|
+---+----+------+

+-------------------+
|sql_num_r_10(score)|
+-------------------+
|              654.0|
|              352.0|
|             890.34|
|              340.0|
|              670.0|
|              970.0|
+-------------------+

+---+----+-------------------+
| id|name|sql_num_r_10(score)|
+---+----+-------------------+
|  1|张三|              654.0|
|  2|李四|              352.0|
|  3|王五|             890.34|
|  4|张三|              340.0|
|  5|李四|              670.0|
|  6|王五|              970.0|
+---+----+-------------------+
'''

#2. 方式2 声明UDF函数，（返回array数据类型）
rdd1 = spark.sparkContext.parallelize([['hellow word'],['hellow python'],['hellow java']])
df = spark.createDataFrame(rdd1,schema='value STRING')
df.show()

def str_split_cnt(x):
    return [(i,'1') for i in x.split(' ')]

obj_udf = F.udf(f=str_split_cnt,returnType=ArrayType(elementType=ArrayType(StringType())))

df.select(obj_udf(df['value'])).show(truncate=False)

'''
+-------------+
|        value|
+-------------+
|  hellow word|
|hellow python|
|  hellow java|
+-------------+

+--------------------------+
|str_split_cnt(value)      |
+--------------------------+
|[[hellow,1],[word,1]]     |
|[[hellow,1],[python,1]]   |
|[[hellow,1],[java,1]]     |
+--------------------------+
'''

#3. 方式2 声明UDF函数，（返回dict数据类型）
rdd1 = spark.sparkContext.parallelize([['hellow word']
                                       ,['hellow python hellow']
                                       ,['hellow java']])

df = spark.createDataFrame(rdd1,schema='value STRING')
df.show()

def str_split_cnt(x):
    return {'name':'word_cnt','cnt_num':len(x.split(' '))}

obj_udf = F.udf(f=str_split_cnt,returnType=StructType()
                .add(field='name',data_type=StringType(),nullable=True)
                .add(field='cnt_num',data_type=IntegerType(),nullable=True)
               )

df.select(obj_udf(df['value']).alias('value')).show(truncate=False)

'''
+--------------------+
|               value|
+--------------------+
|         hellow word|
|hellow python hellow|
|         hellow java|
+--------------------+

+-------------+
|value        |
+-------------+
|{word_cnt, 2}|
|{word_cnt, 3}|
|{word_cnt, 2}|
+-------------+
'''

7、SparkSQL 开窗函数

用途：

和普通sql一样，同一行既要显示聚合前的数据，又要显示聚合后的数据，即在每一行的最后一列添加上聚合函数的结果值。(开窗意思就是为行开辟一个窗口，去观看聚合后的结果)

开窗函数分类：

 1. 聚合开窗函数：
    聚合函数(field_name) over(partition by field_name)
 2. 排序开窗函数：
    排序函数() over([partition by field_name1] order by field_name2 [desc])
 3. 切片开窗函数：
    ntile(n) over(partition by field_name1 order by field_name2 [desc])

# 开窗函数
pd_data = pd.DataFrame({'id':[1,2,3,4,5,6],'name':['张三','李四','王五','张三','李四','王五'],'score':[65.4,35.2,89.034,34,67,97]})
df = spark.createDataFrame(pd_data)
df.show()

df.createOrReplaceTempView('tt')

# 聚合开窗函数
spark.sql('select id,name,score,avg(score) over(partition by name)as avg_score from tt').show()

# 排序开窗函数
spark.sql('select id,name,score,row_number() over(partition by name order by score) as rank_score from tt').show()

# 分组开窗函数
spark.sql('select id,name,score,ntile(3) over(order by score desc)as ntile_score from tt').show()

'''
+---+----+------+
| id|name| score|
+---+----+------+
|  1|张三|  65.4|
|  2|李四|  35.2|
|  3|王五|89.034|
|  4|张三|  34.0|
|  5|李四|  67.0|
|  6|王五|  97.0|
+---+----+------+

+---+----+------+---------+
| id|name| score|avg_score|
+---+----+------+---------+
|  1|张三|  65.4|     49.7|
|  4|张三|  34.0|     49.7|
|  2|李四|  35.2|     51.1|
|  5|李四|  67.0|     51.1|
|  3|王五|89.034|   93.017|
|  6|王五|  97.0|   93.017|
+---+----+------+---------+

+---+----+------+----------+
| id|name| score|rank_score|
+---+----+------+----------+
|  4|张三|  34.0|         1|
|  1|张三|  65.4|         2|
|  2|李四|  35.2|         1|
|  5|李四|  67.0|         2|
|  3|王五|89.034|         1|
|  6|王五|  97.0|         2|
+---+----+------+----------+

+---+----+------+-----------+
| id|name| score|ntile_score|
+---+----+------+-----------+
|  6|王五|  97.0|          1|
|  3|王五|89.034|          1|
|  5|李四|  67.0|          2|
|  1|张三|  65.4|          2|
|  2|李四|  35.2|          3|
|  4|张三|  34.0|          3|
+---+----+------+-----------+
'''

8、SparkSQL Shuffle 分区数目

合理调整SparkSQL Shuffle 分区数目¶

sparksql中当job中产生Shuffle时，默认分区数(spark.sql.shuffle.partitions=200),实际中要合理设置

配置文件：conf/spark-defaults.conf: spark.sql.shuffle.partitions 100
提交任务时指定参数：spark-submit --conf ‘spark.sql.shuffle.partitions=100’
在代码中指定：spark = SparkSession.builder.appName(‘ee’).config(‘spark.sql.shuffle.partitions’,‘100’).getOrCreate()

9、SparkSQL 执行流程

sparksql 执行流程

在这里插入图片描述

提交SparkSQL程序
catalyst优化
- 生成原始AST语法树
- 元数据标记AST
- 进行断言下推和列值裁剪，及其他优化作用到AST上
- 将最终AST生成执行计划
- 将执行计划翻译为RDD代码
Driver执行环境入口构建（SparkSession）
DAG调度器规划逻辑任务（划分stage\task）
TASK调度器分配具体task到executor上并监控管理任务
worker工作

SparkSQL 执行流程自动优化：RDD数据类型不固定，sparksql的dataframe数据是固定的二维表数据结构，可以被针对优化

catalyst优化器：生成rdd执行计划之前，对sql逻辑进行优化

断言前置：将关联表的where filter条件提前，先filter再join,减少shuffle阶段的数据量
列支裁剪：将不需要操作的列，进行裁剪，尽量减少待处理的数据宽度；sparksql默认保存格式parquet,列式存储，方便裁列

附录·：SparkSQL DataFrame对象官网所有属性和方法介绍

属性值	官网注释	备注

columns	Returns all column names as a list.	返回df所有列名称
dtypes	Returns all column names and their data types as a list.	返回df所有列名称和字段类型
isStreaming	Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.	判断数据源是否是流式数据
na	Returns a DataFrameNaFunctions for handling missing values.	`.na` 提供了一系列方法，可以让你处理或操作包含缺失值的 DataFrame。df.na.drop()；df.na.fill();df.na.replace()
rdd	Returns the content as an pyspark.RDD of Row_df
schema	Returns the schema of this DataFrame as a pyspark.sql.types.StructType.	返回dataframe整张表数据结构类型
sparkSession	Returns Spark session that created this DataFrame.
sql_ctx
stat	Returns a DataFrameStatFunctions for statistic functions.	`.stat` 属性提供了一系列方法，可以对 DataFrame 中的数值列进行描述性统计，如均值、标准差、最大值、最小值等
storageLevel	Get the DataFrame’s current storage level.
write	Interface for saving the content of the non-streaming DataFrame out into external storage.	将非流数据帧的内容保存到外部存储器的接口
writeStream	Interface for saving the content of the streaming DataFrame out into external storage.


方法	官网注释	备注

agg(*exprs)	Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).	在没有grouby的情况下聚合整个 DataFrame
alias(alias)	Returns a new DataFrame with an alias set.	给df所有列起别名
approxQuantile(col, probabilities, relativeError)	Calculates the approximate quantiles of numerical columns of a DataFrame.	计算 DataFrame 的数值列的近似分位数。
cache()	Persists the DataFrame with the default storage level (MEMORY_AND_DISK).	对df进行缓存（默认缓存级别：MEMORY_AND_DISK）
[checkpoint([eager])](https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.checkpoint.html#pyspark.sql.DataFrame.checkpoint)	Returns a checkpointed version of this DataFrame.
coalesce(numPartitions)	Returns a new DataFrame that has exactly numPartitions partitions.	对df进行分区修改（一般减少分区）
colRegex(colName)	Selects column based on the column name specified as a regex and returns it as Column.	选择符合正则表达式的列
collect()	Returns all the records as a list of Row.	将所有记录作为 Row 列表返回。
corr(col1, col2[, method])	Calculates the correlation of two columns of a DataFrame as a double value.	计算两列相关性
count()	Returns the number of rows in this DataFrame.	返回此 DataFrame 中的行数。
cov(col1, col2)	Calculate the sample covariance for the given columns, specified by their names, as a double value.	计算协方差
createGlobalTempView(name)	Creates a global temporary view with this DataFrame.	使用此 DataFrame 创建一个全局临时视图。
createOrReplaceGlobalTempView(name)	Creates or replaces a global temporary view using the given name.	使用给定名称创建或替换全局临时视图。
createOrReplaceTempView(name)	Creates or replaces a local temporary view with this DataFrame.	使用此 DataFrame 创建或替换本地临时视图。
createTempView(name)	Creates a local temporary view with this DataFrame.	使用此 DataFrame 创建一个本地临时视图。
crossJoin(other)	Returns the cartesian product with another DataFrame.	返回带有另一个 DataFrame 的笛卡尔积。
crosstab(col1, col2)	Computes a pair-wise frequency table of the given columns.	交叉表
cube(*cols)	Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them.	透视表
describe(*cols)	Computes basic statistics for numeric and string columns.	显示字符串和数值列的基本信息
distinct()	Returns a new DataFrame containing the distinct rows in this DataFrame.	去重
drop(*cols)	Returns a new DataFrame that drops the specified column.	删除列
dropDuplicatessubset([subset])	Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.	返回删除重复行的新 DataFrame，可选择仅考虑某些列。
drop_duplicates([subset])	drop_duplicates() is an alias for dropDuplicates().
dropna([how, thresh, subset])	Returns a new DataFrame omitting rows with null values.	去空值
exceptAll(other)	Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.	返回一个新的DataFrame，其中包含本DataFrame中的行，但不包含另一个DataFrame中的行，同时保留重复项。
explain([extended, mode])	Prints the (logical and physical) plans to the console for debugging purpose.	将执行计划打印到控制台以进行调试。
fillna(value[, subset])	Replace null values, alias for na.fill().	空值填充
filter(condition)	Filters rows using the given condition.	条件过滤
first()	Returns the first row as a Row.	获取第一行
foreach(f)	Applies the f function to all Row of this DataFrame.	将 f 函数应用于此 DataFrame 的所有行。
foreachPartition(f)	Applies the f function to each partition of this DataFrame.	将 f 函数应用于此 DataFrame 的每个分区
freqItems(cols[, support])	Finding frequent items for columns, possibly with false positives.	查找列中经常出现的项，可能有误报。
groupBy(*cols)	Groups the DataFrame using the specified columns, so we can run aggregation on them.	使用指定的列对DataFrame进行分组，这样我们就可以对它们运行聚合。
groupby(*cols)	groupby() is an alias for groupBy().	分组
head([n])	Returns the first n rows.	返回前n行
hint(name, *parameters)	Specifies some hint on the current DataFrame.	指定当前 DataFrame 的一些提示。
inputFiles()	Returns a best-effort snapshot of the files that compose this DataFrame.	快照
intersect(other)	Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.	求交集
intersectAll(other)	Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.	返回一个新的DataFrame，其中包含这个DataFrame和另一个DataFrame中的行，同时保留重复的行。
isEmpty()	Returns True if this DataFrame is empty.	判断是否为空
isLocal()	Returns True if the collect() and take() methods can be run locally (without any Spark executors).	判断driver是否可以容纳collect()
join(other[, on, how])	Joins with another DataFrame, using the given join expression.	关联表
limit(num)	Limits the result count to the number specified.	将结果计数限制为指定的数量。
localCheckpoint([eager])	Returns a locally checkpointed version of this DataFrame.
mapInArrow(func, schema)	Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as a DataFrame.
mapInPandas(func, schema)	Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame.	使用Python本地函数映射当前DataFrame中的批迭代器，该函数接受并输出pandas DataFrame，并将结果作为DataFrame返回。
observe(observation, *exprs)	Observe (named) metrics through an Observation instance.
orderBy(cols, *kwargs)	Returns a new DataFrame sorted by the specified column(s).	排序
pandas_api([index_col])	Converts the existing DataFrame into a pandas-on-Spark DataFrame.	将现有 DataFrame 转换为 pandas-on-Spark DataFrame。
persist([storageLevel])	Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed.	设置存储级别以在第一次计算后跨操作保留 DataFrame 的内容
printSchema()	Prints out the schema in the tree format.	以树格式打印出表格架构。
randomSplit(weights[, seed])	Randomly splits this DataFrame with the provided weights.	使用提供的权重随机拆分此 DataFrame。
registerTempTable(name)	Registers this DataFrame as a temporary table using the given name.	使用给定名称将此 DataFrame 注册为临时表。
repartition(numPartitions, *cols)	Returns a new DataFrame partitioned by the given partitioning expressions.	返回由给定分区表达式分区的新 DataFrame。（可以增加分区）
repartitionByRange(numPartitions, *cols)	Returns a new DataFrame partitioned by the given partitioning expressions.	返回由给定分区表达式分区的新 DataFrame。
replace(to_replace[, value, subset])	Returns a new DataFrame replacing a value with another value.	替换操作，和pandas一样
rollup(*cols)	Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.	使用指定列为当前DataFrame创建多维汇总，这样我们就可以对它们运行聚合
sameSemantics(other)	Returns True when the logical query plans inside both DataFrames are equal and therefore return same results.
sample([withReplacement, fraction, seed])	Returns a sampled subset of this DataFrame.	返回此 DataFrame 的采样子集。
sampleBy(col, fractions[, seed])	Returns a stratified sample without replacement based on the fraction given on each stratum.	根据条件抽样
select(*cols)	Projects a set of expressions and returns a new DataFrame.	按列名进行列选择
selectExpr(*expr)	Projects a set of SQL expressions and returns a new DataFrame.	根据sql表达式选择部分数据
semanticHash()	Returns a hash code of the logical query plan against this DataFrame.
show([n, truncate, vertical])	Prints the first n rows to the console.	将前 n 行打印到控制台。
sort(cols, *kwargs)	Returns a new DataFrame sorted by the specified column(s).	排序
sortWithinPartitions(cols, *kwargs)	Returns a new DataFrame with each partition sorted by the specified column(s).	分区内排序
subtract(other)	Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.	求差集
summary(*statistics)	Computes specified statistics for numeric and string columns.	计算数字和字符串列的指定统计信息。
tail(num)	Returns the last num rows as a list of Row.	将最后 num 行作为 Row 列表返回。
take(num)	Returns the first num rows as a list of Row.	将前 num 行作为 Row 的列表返回。
toDF(*cols)	Returns a new DataFrame that with new specified column names	返回具有新指定列名的新 DataFrame
toJSON([use_unicode])	Converts a DataFrame into a RDD of string.	将 DataFrame 转换为字符串类型RDD
toLocalIterator([prefetchPartitions])	Returns an iterator that contains all of the rows in this DataFrame.	返回包含此 DataFrame 中所有行的迭代器。
toPandas()	Returns the contents of this DataFrame as Pandas pandas.DataFrame.	将此 DataFrame 的内容作为 Pandas pandas.DataFrame 返回。
to_koalas([index_col])
to_pandas_on_spark([index_col])
transform(func, args, *kwargs)	Returns a new DataFrame.
union(other)	Return a new DataFrame containing union of rows in this and another DataFrame.	两个df合并(去重？)
unionAll(other)	Return a new DataFrame containing union of rows in this and another DataFrame.	两个df合并(不去重)
unionByName(other[, allowMissingColumns])	Returns a new DataFrame containing union of rows in this and another DataFrame.	返回一个包含此数据框和另一个数据框中的行并集的新数据框。
unpersist([blocking])	Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.	清理缓存
where(condition)	where() is an alias for filter().	过滤和filter一样
withColumn(colName, col)	Returns a new DataFrame by adding a column or replacing the existing column that has the same name.	添加或替换列(或对某一列进行F操作)
withColumnRenamed(existing, new)	Returns a new DataFrame by renaming an existing column.	列名修改
withColumns(*colsMap)	Returns a new DataFrame by adding multiple columns or replacing the existing columns that has the same names.	添加或替换多列
withMetadata(columnName, metadata)	Returns a new DataFrame by updating an existing column with metadata.	通过使用元数据更新现有列来返回新的 DataFrame。
withWatermark(eventTime, delayThreshold)	Defines an event time watermark for this DataFrame.	为此 DataFrame 定义事件时间水印。
writeTo(table)	Create a write configuration builder for v2 sources.

yunpeng.zhou

关注

4
点赞
踩
22

收藏

觉得还不错? 一键收藏
0
评论
Spark-core/SparkSQL 简单使用总结

Spark-core总结 RDD对象特性和RDD常用算子总结（SparkContext对象创建、RDD对象创建、Transformation算子Action 算子分区操作算子 RDD优化缓存 cache与checkpoint RDD共享变量与累加器共享变量（广播变量与累加器）RDD全局并行度设置；SparkSQL总结 DataFrame构建 DataFrame数据处理代码风格 DataFrame数据清洗API DataFrame注册成表 SparkSQL 数据保存自定义udf函数开窗函数 Shuff
复制链接

扫一扫