读取xlsx
版本:
IntelliJ IDEA Community Edition 2019.2.4
apache-maven-3.6.2
Spark 2.0.2
hadoop2.6_Win_x64-master
话不多说,直奔主题:
我开始试着用Spark Context去读取,发现不行,就用了SparkSession
1. 首先导入jar包(注意要版本一致,不然会喷错):
pom.xml
<!-- 读取excel xlsx-->
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_2.11</artifactId>
<version>0.12.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.xmlbeans/xmlbeans -->
<dependency>
<groupId>org.apache.xmlbeans</groupId>
<artifactId>xmlbeans</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
2.
package com.h3.pro
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
object Task1 {
def main(args: Array[String]): Unit = {
//这是因为我没有配置hadoop环境变量,我是在win10上运行的。
System.setProperty("hadoop.home.dir", "D:\\software\\hadoop2.6_Win_x64-master");
val conf = new SparkConf().setAppName("Task1").setMaster("local")
val context = new SparkContext(conf)
val frame = SparkSession.builder().getOrCreate().read.format("com.crealytics.spark.excel")
.option("useHeader", "true")
//这三行可以要,可以不要
//.option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
//.option("inferSchema", "false")
//.option("workbookPassword", "None")
.load("***.xlsx")
frame.take(10).foreach(println)
}
}