spark基础3

最新推荐文章于 2023-05-08 17:37:23 发布

风可。

最新推荐文章于 2023-05-08 17:37:23 发布

阅读量178

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/u012925804/article/details/113641803

版权

大数据专栏收录该内容

6 篇文章 2 订阅

订阅专栏

通用的文件读取方式、spark转pandas

通用的文件读取方式

author：https://blog.csdn.net/u012925804/article/details/113641803

读的时候使用spark.read.load读取，可以指定类型(json, parquet, jdbc, orc, libsvm, csv, text)
写的时候也是可以指定类型。python和scala略微不一样。

pyspark：

df = spark.read.load("examples/src/main/resources/people.json", format="json")
df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")

scala：

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

例子：

读取csv：

python：

df = spark.read.load("examples/src/main/resources/people.csv",
                     format="csv", sep=":", inferSchema="true", header="true")

Scala: 通过option配置参数。

val peopleDFCsv = spark.read.format("csv")
  .option("sep", ";")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("examples/src/main/resources/people.csv")

执行sql：

df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

保存模型

可以是覆盖、append（追加）等方式：

Save operations can optionally take a SaveMode, that specifies how to handle existing data if present. It is important to realize that these save modes do not utilize any locking and are not atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out the new data.

Scala/Java	Any Language	Meaning
`SaveMode.ErrorIfExists` (default)	`"error" or "errorifexists"` (default)	When saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
`SaveMode.Append`	`"append"`	When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
`SaveMode.Overwrite`	`"overwrite"`	Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
`SaveMode.Ignore`	`"ignore"`	Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected not to save the contents of the DataFrame and not to change the existing data. This is similar to a `CREATE TABLE IF NOT EXISTS` in SQL.

分桶、分区、Bucketing, Sorting and Partitioning

For file-based data source, it is also possible to bucket and sort or partition the output. Bucketing and sorting are applicable only to persistent tables:

df.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

Find full example code at “examples/src/main/python/sql/datasource.py” in the Spark repo.

while partitioning can be used with both save and saveAsTable when using the Dataset APIs.

df.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")

Find full example code at “examples/src/main/python/sql/datasource.py” in the Spark repo.

It is possible to use both partitioning and bucketing for a single table:

df = spark.read.parquet("examples/src/main/resources/users.parquet")
(df
    .write
    .partitionBy("favorite_color")
    .bucketBy(42, "name")
    .saveAsTable("people_partitioned_bucketed"))

PySpark Usage Guide for Pandas with Apache Arrow

http://spark.apache.org/docs/2.4.3/sql-pyspark-pandas-with-arrow.html

什么是分桶？和分区有什么区别？

分区：Hive在查询数据的时候，一般会扫描整个表的数据,会消耗很多不必要的时间。有些时候，我们只需要关心一部分数据,比如WHERE子句的查询条件，那这时候这种全表扫描的方式是很影响性能的。从而引入了分区的概念。分区就是对数据进行分类，这样在查询的时候，就可以只是针对分区查询，从而不必全表扫描。

一个目录对应一个分区

分桶：并非所有的数据集都可形成合理的分区，特别之前所提到过的要确定合适的划分大小的疑虑。对于每一个表或者分区，可以进一步细分成桶，桶是对数据进行更细粒度的划分。Hive默认采用对某一列的每个数据进行hash（哈希），使用hashcode对桶的个数求余，确定该条记录放入哪个桶中。

链接：https://www.jianshu.com/p/004462037557

风可。

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark基础3

通用的文件读取方式、spark转pandas通用的文件读取方式、spark转pandaspyspark：scala：例子：保存模型分桶、分区、Bucketing, Sorting and PartitioningPySpark Usage Guide for Pandas with Apache Arrowhttp://spark.apache.org/docs/2.4.3/sql-pyspark-pandas-with-arrow.html什么是分桶？和分区有什么区别？通用的文件读取方式、spark转p
复制链接

扫一扫

专栏目录