01-Read&Write

最新推荐文章于 2023-02-23 08:57:11 发布

wangyanglongcc

最新推荐文章于 2023-02-23 08:57:11 发布

阅读量768

点赞数

本文链接：https://blog.csdn.net/qq_33246702/article/details/124327308

版权

Azure Databricks in Action 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

Reader

Read from CSV files

spark.read.csv也可以读取csv文件，而且更常用。

Read from CSV with DataFrameReader’s csv method and the following options:
Tab separator, use first line as header, infer schema

file_csv = "/mnt/training/ecommerce/users/users-500k.csv"
df = (spark.read
  .option("sep", "\t") # 指定分割符为\t
  .option("header", True) # 首行是表头
  .option("inferSchema", True) # 自行判断schema/字段类型
  .csv(file_csv ) # 文件路径
  )
df.printSchema() # 查看一下数据schema

在这里插入图片描述

有时候程序自行判断的schema可能不对，这种情况下我们也可以自行指定。指定schema也有2种方式，分别是StructType和DDLSchema

StructType

from pyspark.sql.types import DoubleType, IntegerType, StringType, StructField, StructType, TimestampType
userDefinedSchema = StructType([
  StructField("user_id", StringType(), True),
  StructField("user_first_touch_timestamp", LongType(), True),
  StructField("email", StringType(), True)
])

df = (spark.read
  .option("sep", "\t") # 指定分割符为\t
  .option("header", True) # 首行是表头
  .schema(userDefinedSchema) # 使用上面定义的schema
  .csv(file_csv ) # 文件路径
  )

DDLSchema

DDLSchema = "user_id string, user_first_touch_timestamp long, email string"

df = (spark.read
  .option("sep", "\t")
  .option("header", True)
  .schema(DDLSchema)
  .csv(file_csv)
  )

schema信息的获取可以通过两种方式，一是通过df.schema直接获取，该方法生成的schema是StructField类型的。还有一种方式是生成DDLSchema类型，该方式则需要使用scala获取，python没有相应方法。

spark.read.parquet("/mnt/training/ecommerce/events/events.parquet").schema.toDDL

在这里插入图片描述

Read from JSON files

spark.read.json也可以读取json文件，而且更常用。

file_json = "/mnt/training/ecommerce/events/events-500k.json"

df = (spark.read
  .option("inferSchema", True)
  .json(file_json)
  )

df.printSchema()

在这里插入图片描述

这里可以看到json文件有时候是有层级关系的，那么如果指定schema的时候，也要设置相应的层级关系。如上例中，

ecommerce

purchaseRevenus

total_item_quantity

unique_items
geo

city

stats
items 里面是个array

coupon

item_id

item_name

item_revenue_in_usd

price_in_usd

quantity

userDefinedSchema = StructType([
  StructField("device", StringType(), True),
  StructField("ecommerce", StructType([
                          StructField("purchaseRevenue", DoubleType(), True),
                          StructField("total_item_quantity", LongType(), True),
                          StructField("unique_items", LongType(), True)
                          ])
              , True),
  StructField("event_name", StringType(), True),
  StructField("event_previous_timestamp", LongType(), True),
  StructField("event_timestamp", LongType(), True),
  StructField("geo", StructType([
                      StructField("city", StringType(), True),
                      StructField("state", StringType(), True)
                    ])
              , True),
  StructField("items", ArrayType(
                      StructType([
                        StructField("coupon", StringType(), True),
                        StructField("item_id", StringType(), True),
                        StructField("item_name", StringType(), True),
                        StructField("item_revenue_in_usd", DoubleType(), True),
                        StructField("price_in_usd", DoubleType(), True),
                        StructField("quantity", LongType(), True)
                      ])
                    )
              , True),
  StructField("traffic_source", StringType(), True),
  StructField("user_first_touch_timestamp", LongType(), True),
  StructField("user_id", StringType(), True)
])

当然也可以用DDLSchema进行指定。

Writer

Write DataFrame to file

通常我们会将数据以parquet格式写出，当然csv格式也是支持的。

以parquet格式写出

outputfile_path = f'/mnt/dbwarehouse/files/{filename}.parquet'

df\
.write\
.option("comparession","snappy")\ # 指定压缩方式
.mode("overwrite")\ # 写入方式
.parquet(outputfile_path) # 指定路径

其实还有更简短的写法

df.write.parquet(outputfile_path,mode='overwrite')

以csv格式写出

outputfile_path = f'/mnt/dbwarehouse/files/{filename}.csv'

df\
.write\
.mode("overwrite")\ # 写入方式
.csv(outputfile_path) # 指定路径

其实还有更简短的写法

df.write.csv(outputfile_path,mode='overwrite')

Write DataFrame to table

使用saveAsTable将DataFrame保存为table

tb_name = 'users'
df.write.mode("overwrite").saveAsTable(tb_name)

Write DataFrame to delta table

通过指定format为delta和save方法，将DataFrame保存为delta table。

delta_tb_path = '/mnt/dbwarehouse/delat/users'
df.write.format('delat').mode('overwrite').save(delta_tb_path)

这篇文档讲述的是最为基础的读写方式，在实际的工作场景中，当遇到数据写入时，我们通常是先建表，在建表的时候指定数据的类型,存储路径(路径通常在建库时指定)，如parquet,textfile,delta，将DataFrame 注册成临时视图createOrReplaceTempView后，再往表里写入。如

create database if not exists demo location "/mnt/dbwarehouse/demo";
create table if not exists demo.first_table(
id int,
value string
) using delta

df.createOrReplaceTempView('df')
spark.sql('insert overwrite demo.first_table select * from df')

wangyanglongcc

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
01-Read&Write

ReaderRead from CSV filesspark.read.csv也可以读取csv文件，而且更常用。Read from CSV with DataFrameReader’s csv method and the following options:Tab separator, use first line as header, infer schemafile_csv = "/mnt/training/ecommerce/users/users-500k.csv"df = (sp
复制链接

扫一扫

专栏目录