一、 数据抽取
任务一:数据抽取
使用Spark工具,将MySQL的shtd_store库中表CUSTOMER、NATION、PART、PARTSUPP、REGION、SUPPLIER
的数据全量抽取到Hive的ods库中对应表customer,nation,part,partsupp,region,sup
1、抽取shtd_store库中CUSTOMER的全量数据进入Hive的ods库中表customer。 * 字段排序、类型不变,同时添加静态分区,分区字段类型为String, * 且值为当前比赛日的前一天日期(分区字段格式为yyyyMMdd)。 * 并在hive cli执行show partitions ods.customer命令, * 将Spark提交命令及hive cli的执行结果分别截图复制粘贴至对应报告中;
object BC_Extract_Clean {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.appName("ods_dwd")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.warehouse.dir", "hdfs://xueai:8020")
.config("hive.metastore.uris", "thrift://xueai:9083/user/hive/warehouse")
.config("java.jdo.option.ConnectionDriverName", "com.mysql.jdbc.Driver")
.config("spark.sql.shuffle.ConnectionUrl", "jdbc:mysql://xueai:3306/hive?createDatabaseIfNoExist=UTF-8")
.enableHiveSupport()
.getOrCreate()
Test1_Cq(spark)
spark.sql("show partitions ods.customer").show()
}
def Test1_Cq(spark: SparkSession): Unit = {
val MyURL = "jdbc:mysql://xueai:3306/shtd_store"
val jdbcMap = Map(
"url" -> MyURL,
"dbtable" -> "CUSTOMER",
"user" -> "root",
"password" -> "admin"
)
val df1 = ETL_tool.MysqlJDBC(spark, jdbcMap)
val df2 = df1.withColumn("etl_date", lit("20240303"))
val hiveOptions = Map("db" -> "ods", "tb" -> "customer", "lie" -> "etl_date")
ETL_tool.hiveLoad(spark, df2, hiveOptions, "overwrite")
}
先启动hdfs,在执行hive --service merastore,后运行代码
后续会一题一题发布,请关注期待