读取excel_PySpark读取Excel

最新推荐文章于 2024-07-20 09:00:00 发布

苏澈阿

最新推荐文章于 2024-07-20 09:00:00 发布

阅读量3.7k

点赞数 1

文章标签：读取excel

本文链接：https://blog.csdn.net/weixin_36297381/article/details/112679630

版权

博客介绍了在日常工作中如何使用PySpark处理客户通过Excel提供的数据。内容包括使用Pandas读取并转换为parquet格式，以及利用spark-excel和HadoopOffice两个Spark插件读取Excel文件。同时，博主提供了相关开源项目的链接，并提到了提供的服务如PySpark和Excel培训、数据挖掘咨询等。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

日常工作中，客户通过Excel提供数据是一种很常见的方式，既然碰到了就得解决。

我常用的办法就是Pandas读取，并保存为parquet，

# 如果只读取一个sheet，import pandas as pddf=pd.read_excel("excel1.xlsx")df.to_parquet("excel_etl/excel1.parquet")

# 如果一个Excel有多个sheetimport pandas as pdxl=pd.ExcelFile("多sheetExcel.xlsx")sheets=xl.sheet_namesfor sheet in sheets:    print(sheet)    df=xl.parse(sheet)    df.to_parquet(f"excel_etl/{sheet}.parquet")

如果所有sheets格式一致，pyspark可以轻松一次读取全部数据，

from pyspark.sql import SparkSessionspark = SparkSession.builder\.master("local[*]")\.getOrCreate()

#只需要读取整个目录即可df=spark.read.parquet("excel_etl")#也可以通过正则表达式来选择性读取自己想读取的parquet# df=spark.read.parquet("excel_etl/*.parquet")

另外也可以使用Spark Excel插件(均基于POI)来读取，这里介绍两款，

spark-excel

from pyspark import SparkConfconf=SparkConf()\.set("spark.jars.packages","com.crealytics:spark-excel_2.11:0.11.1")\.set("spark.sql.shuffle.partitions", "4")\.set("spark.sql.execution.arrow.enabled", "true")\.set("spark.driver.maxResultSize","6G")\.set('spark.driver.memory','6G')\.set('spark.executor.memory','6G')from pyspark.sql import SparkSessionspark = SparkSession.builder\.config(conf=conf)\.master("local[*]")\.getOrCreate()xlsx="Online Retail.xlsx"df = spark.read\    .format("com.crealytics.spark.excel")\    .option("useHeader", "true")\    .option("treatEmptyValuesAsNulls", "false")\    .option("inferSchema", "true") \    .option("timestampFormat", "MM-dd-yyyy HH:mm:ss")\    .option("maxRowsInMemory", 20)\    .option("excerptSize", 10)\    .load(xlsx)    df.printSchema()

root
 |-- InvoiceNo: double (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: double (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

# 读入的字段类型有误，需要做适当调整import pyspark.sql.functions as fdf=df.withColumn("InvoiceNo",f.col('InvoiceNo').cast("string"))df.write.parquet("Online Retail",mode="overwrite")

HadoopOffice(不仅仅支持读，还支持写)

# Excel行数太多，测试失败，行数较少的时候没有问题# 另外非常耗资源，没有在集群上做过测试from pyspark import SparkConfconf=SparkConf()\.set("spark.jars.packages","com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.3.0")\.set("spark.sql.shuffle.partitions", "4")\.set("spark.sql.execution.arrow.enabled", "true")\.set("spark.driver.maxResultSize","6G")\.set("spark.sql.execution.arrow.enabled", "true")\.set('spark.driver.memory','6G').set('spark.executor.memory','6G')from pyspark.sql import SparkSessionspark = SparkSession.builder\.config(conf=conf)\.master("local[*]")\.getOrCreate()df=spark.read.format('org.zuinnote.spark.office.excel')\.option("read.locale.bcp47","zh-Hans")\.option("read.spark.simpleMode",True)\.option("read.header.read",True)\.load("Online Retail.xlsx")df.printSchema()df.write.parquet("Online Retail",mode="overwrite")

参考资源，

https://github.com/ZuInnoTe/spark-hadoopoffice-ds
https://github.com/crealytics/spark-excel

衡数提供如下服务，有意请留言，如果你愿意分享你的文章，也欢迎留言
- PySpark培训
- Excel培训
- 数据挖掘咨询
- 数据挖掘外包
- 人才推荐

历史文章：

17个新手常见Python运行时错误

PySpark 之批量执行SQL语句

python基础入门教程《python入门经典》

开源BI Metabase与Spark SQL的碰撞

为了更好的服务数据圈内同学，我们需要更多的志愿者，主要协助推广转发，寻找更多更好的内容，有兴趣的同学可以联系L23683716，加志愿者群。

求职招聘，技术交流：