前置条件:开启hivemetastore服务。能与hive进行交互
[hadoop@hadoop-01 data]$ spark-shell --driver-class-path /home/hadoop/app/hive-1.1.0-cdh5.7.0/lib/mysql-connector-java.jar
scala> import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql.hive.orc._
scala> import org.apache.spark.sql._
import org.apache.spark.sql._
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
scala> val lines=sc.textFile("/data/inputSource.txt")
lines: org.apache.spark.rdd.RDD[String] = /data/inputSource.txt MapPartitionsRDD[1] at textFile at <console>:32
## 进行etl
scala> val data = lines.map(x => {
| val str = x.split("\t")
| (str(0),str(1),str(3),str(4),str(6),str(10),str(12),str(20))

本文介绍了如何使用SparkCore进行数据提取、转换、加载(ETL)操作,并将处理后的数据导出为ORC和Parquet文件格式。首先确保开启Hive Metastore服务,以实现与Hive的交互。
最低0.47元/天 解锁文章
455

被折叠的 条评论
为什么被折叠?



