spark基础3

通用的文件读取方式

author:https://blog.csdn.net/u012925804/article/details/113641803

  • 读的时候使用spark.read.load读取,可以指定类型(json, parquet, jdbc, orc, libsvm, csv, text)

  • 写的时候也是可以指定类型。python和scala略微不一样。

pyspark:

df = spark.read.load("examples/src/main/resources/people.json", format="json")
df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")

scala:

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

例子:

读取csv:

python:

df = spark.read.load("examples/src/main/resources/people.csv",
                     format="csv", sep=":", inferSchema="true", header="true")

Scala: 通过option配置参数。

val peopleDFCsv = spark.read.format("csv")
  .option("sep", ";")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("examples/src/main/resources/people.csv")

执行sql:

df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

保存模型

可以是覆盖、append(追加)等方式:

Save operations can optionally take a SaveMode, that specifies how to handle existing data if present. It is important to realize that these save modes do not utilize any locking and are not atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out the new data.

Scala/JavaAny LanguageMeaning
SaveMode.ErrorIfExists (default)"error" or "errorifexists" (default)When saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
SaveMode.Append"append"When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
SaveMode.Overwrite"overwrite"Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
SaveMode.Ignore"ignore"Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected not to save the contents of the DataFrame and not to change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.

分桶、分区、Bucketing, Sorting and Partitioning

For file-based data source, it is also possible to bucket and sort or partition the output. Bucketing and sorting are applicable only to persistent tables:

df.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

Find full example code at “examples/src/main/python/sql/datasource.py” in the Spark repo.

while partitioning can be used with both save and saveAsTable when using the Dataset APIs.

df.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")

Find full example code at “examples/src/main/python/sql/datasource.py” in the Spark repo.

It is possible to use both partitioning and bucketing for a single table:

df = spark.read.parquet("examples/src/main/resources/users.parquet")
(df
    .write
    .partitionBy("favorite_color")
    .bucketBy(42, "name")
    .saveAsTable("people_partitioned_bucketed"))

PySpark Usage Guide for Pandas with Apache Arrow

http://spark.apache.org/docs/2.4.3/sql-pyspark-pandas-with-arrow.html

什么是分桶?和分区有什么区别?

分区:Hive在查询数据的时候,一般会扫描整个表的数据,会消耗很多不必要的时间。有些时候,我们只需要关心一部分数据,比如WHERE子句的查询条件,那这时候这种全表扫描的方式是很影响性能的。从而引入了分区的概念。分区就是对数据进行分类,这样在查询的时候,就可以只是针对分区查询,从而不必全表扫描。

一个目录对应一个分区

分桶:并非所有的数据集都可形成合理的分区,特别之前所提到过的要确定合适的划分大小的疑虑。对于每一个表或者分区,可以进一步细分成桶,桶是对数据进行更细粒度的划分。Hive默认采用对某一列的每个数据进行hash(哈希),使用hashcode对 桶的个数求余,确定该条记录放入哪个桶中。

链接:https://www.jianshu.com/p/004462037557

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值