日常工作中,客户通过Excel提供数据是一种很常见的方式,既然碰到了就得解决。
我常用的办法就是Pandas读取,并保存为parquet,
# 如果只读取一个sheet,import pandas as pddf=pd.read_excel("excel1.xlsx")df.to_parquet("excel_etl/excel1.parquet")
# 如果一个Excel有多个sheetimport pandas as pdxl=pd.ExcelFile("多sheetExcel.xlsx")sheets=xl.sheet_namesfor sheet in sheets: print(sheet) df=xl.parse(sheet) df.to_parquet(f"excel_etl/{sheet}.parquet")
如果所有sheets格式一致,pyspark可以轻松一次读取全部数据,
from pyspark.sql import SparkSessionspark = SparkSession.builder\.master("local[*]")\.getOrCreate()
#只需要读取整个目录即可df=spark.read.parquet("excel_etl")#也可以通过正则表达式来选择性读取自己想读取的parquet# df=spark.read.parquet("excel_etl/*.parquet")
另外也可以使用Spark Excel插件(均基于POI)来读取,这里介绍两款,
spark-excel
from pyspark import SparkConfconf=SparkConf()\.set("spark.jars.packages","com.crealytics:spark-excel_2.11:0.11.1")\.set("spark.sql.shuffle.partitions", "4")\.set("spark.sql.execution.arrow.enabled", "true")\.set("spark.driver.maxResultSize","6G")\.set('spark.driver.memory','6G')\.set('spark.executor.memory','6G')from pyspark.sql import SparkSessionspark = SparkSession.builder\.config(conf=conf)\.master("local[*]")\.getOrCreate()xlsx="Online Retail.xlsx"df = spark.read\ .format("com.crealytics.spark.excel")\ .option("useHeader", "true")\ .option("treatEmptyValuesAsNulls", "false")\ .option("inferSchema", "true") \ .option("timestampFormat", "MM-dd-yyyy HH:mm:ss")\ .option("maxRowsInMemory", 20)\ .option("excerptSize", 10)\ .load(xlsx) df.printSchema()
root
|-- InvoiceNo: double (nullable = true)
|-- StockCode: string (nullable = true)
|-- Description: string (nullable = true)
|-- Quantity: double (nullable = true)
|-- InvoiceDate: timestamp (nullable = true)
|-- UnitPrice: double (nullable = true)
|-- CustomerID: double (nullable = true)
|-- Country: string (nullable = true)
# 读入的字段类型有误,需要做适当调整import pyspark.sql.functions as fdf=df.withColumn("InvoiceNo",f.col('InvoiceNo').cast("string"))df.write.parquet("Online Retail",mode="overwrite")
HadoopOffice(不仅仅支持读,还支持写)
# Excel行数太多,测试失败,行数较少的时候没有问题# 另外非常耗资源,没有在集群上做过测试from pyspark import SparkConfconf=SparkConf()\.set("spark.jars.packages","com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.3.0")\.set("spark.sql.shuffle.partitions", "4")\.set("spark.sql.execution.arrow.enabled", "true")\.set("spark.driver.maxResultSize","6G")\.set("spark.sql.execution.arrow.enabled", "true")\.set('spark.driver.memory','6G').set('spark.executor.memory','6G')from pyspark.sql import SparkSessionspark = SparkSession.builder\.config(conf=conf)\.master("local[*]")\.getOrCreate()df=spark.read.format('org.zuinnote.spark.office.excel')\.option("read.locale.bcp47","zh-Hans")\.option("read.spark.simpleMode",True)\.option("read.header.read",True)\.load("Online Retail.xlsx")df.printSchema()df.write.parquet("Online Retail",mode="overwrite")
参考资源,
https://github.com/ZuInnoTe/spark-hadoopoffice-ds
https://github.com/crealytics/spark-excel
衡数提供如下服务,有意请留言,如果你愿意分享你的文章,也欢迎留言
PySpark培训
Excel培训
数据挖掘咨询
数据挖掘外包
人才推荐
历史文章:
17个新手常见Python运行时错误
PySpark 之批量执行SQL语句
python基础入门教程《python入门经典》
开源BI Metabase与Spark SQL的碰撞
为了更好的服务数据圈内同学,我们需要更多的志愿者,主要协助推广转发,寻找更多更好的内容,有兴趣的同学可以联系L23683716,加志愿者群。
求职招聘,技术交流: