0.写在前面
今天,我打算新开一个专栏,主要用于来总结分享Spark算子,后续内容请关注本专栏Spark算子合集
今天分享的第一篇文章就是利用Spark读取Excel数据。由于最近在做算法标签系统,由于数据打标数据采用Excel进行存储,因此我通过Spark读取Excel数据进行数据存储
1.pom.xml依赖
maven仓库中Spark-Excel依赖的地址为:https://mvnrepository.com/artifact/com.crealytics/spark-excel,于此同时还是需要导入Spark依赖的core和SQL的依赖。
下面是我使用的依赖
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.crealytics/spark-excel -->
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_2.12</artifactId>
<version>0.13.7</version>
</dependency>
3.创建运行环境
val conf = new SparkConf().setAppName("itoe_labelSystem").setMaster("local[*]")
val spark=SparkSession.builder().config(conf).getOrCreate()
4.定义表结构
//定义表结构
val schema = StructType(List(
StructField("id",StringType),
StructField("equipID",StringType),
StructField("faultID",IntType),
StructField("reason",StringType),
StructField("TimeStamp",DateType)
))
关于数据类型在org.apache.spark.sql.types
包下定义,可以进入包中选择适合自己的数据类型
5.DataFrame读取Excel
spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'sheet1'!A5:E8")//sheet1为工作簿名称,A5为左上角单元格地址,E8为左下角单元格地址,
.option("header", "false")//是否使用头部
.schema(schema)//加载表结构
.load(path)//path 为Excel读取路径