Pyspark使用

山猪打不过家猪

已于 2024-06-09 12:29:15 修改

阅读量301

点赞数 8

分类专栏： DE 文章标签： flask python 后端

于 2024-06-07 15:14:11 首次发布

本文链接：https://blog.csdn.net/weixin_42067536/article/details/139523714

版权

本文详细介绍了如何使用Pyspark进行文件挂载、读取操作，包括从Azure Blob读取文件，选择列，应用filter条件，列操作如增加、删除和重命名，扁平化Json和列表，以及join、union方法。此外，还讨论了when方法和pivot/unpivot的实际应用场景。

摘要由CSDN通过智能技术生成

系列文章目录

提示：这里可以添加系列文章的所有文章的目录，目录需要自己手动添加
例如：第一章 Python 机器学习入门之pandas的使用

一、文件挂载和文件操作

1. 挂载Azure blob里的文件(常用）

source ：blob里的文件地址：source = 'wasbs://<container名称>@<blob的名称>.blob.core.windows.net
extra_configs ：具体的blob的配置：extra_configs = {'fs.azure.account.key.<blob的名称>.blob.core.windows.net':'<container里的access key>'}
mount_point：为挂载点添加名称mount_point= '/mnt/<自定义名称>'

dbutils.fs.mount(source = 'wasbs://input@dl108lg.blob.core.windows.net',
                          mount_point= '/mnt/blob108_input',
                          extra_configs = {
   'fs.azure.account.key.dl108lg.blob.core.windows.net':'fuDFNS1ziD9Lw4aeH/N6gw7+4'}
)

2. 文件操作

2.1 查看当前所有的mounts

dbutils.fs.mounts()

2.2 读取csv/parquet/delta

path = '/mnt/blob108_bronze/Sales/emp.csv
df = spark.read.format('csv').load(path,header = True)

读取parquet

path = '/mnt/blob108_bronze/Sales/emp.parquet
df = spark.read.format('parquet').load(path,header = True)

读取delta

path = '/mnt/blob108_bronze/Sales/emp.parquet
df = spark.read.format('delta').load(path,header = True)

2.3 读取时添加自定义的schema

自定义schema，默认使用的inferSchema

from pyspark.sql.types import StructType, StructFiled, IntegerType, StringType

schema_defined = StructType(
	StructField('Year', IntegerType(),True),
	StructField('Name', StringType(),True),
	StructField('Country', StringType(),True)
)
csv_path = '/mnt/blob108_bronze/Sales/employee.csv

df = spark.read.format("csv").schema(schema_defined ).option("header",True).option("sep",",").load(csv_path )

df.printSchema()