职业技能大赛-大数据-数据抽取

最新推荐文章于 2024-07-31 18:50:50 发布

阿汐L

最新推荐文章于 2024-07-31 18:50:50 发布

阅读量489

点赞数 14

文章标签：大数据 spark hive mysql scala idea

本文链接：https://blog.csdn.net/forest_axi/article/details/137089061

版权

一、数据抽取

任务一：数据抽取
使用Spark工具，将MySQL的shtd_store库中表CUSTOMER、NATION、PART、PARTSUPP、REGION、SUPPLIER
的数据全量抽取到Hive的ods库中对应表customer，nation，part，partsupp，region，sup

1、抽取shtd_store库中CUSTOMER的全量数据进入Hive的ods库中表customer。 * 字段排序、类型不变，同时添加静态分区，分区字段类型为String， * 且值为当前比赛日的前一天日期（分区字段格式为yyyyMMdd）。 * 并在hive cli执行show partitions ods.customer命令， * 将Spark提交命令及hive cli的执行结果分别截图复制粘贴至对应报告中；

object BC_Extract_Clean {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local[*]")
      .appName("ods_dwd")
      .config("hive.exec.dynamic.partition", "true")
      .config("hive.exec.dynamic.partition.mode", "nonstrict")
      .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
      .config("spark.sql.warehouse.dir", "hdfs://xueai:8020")
      .config("hive.metastore.uris", "thrift://xueai:9083/user/hive/warehouse")
      .config("java.jdo.option.ConnectionDriverName", "com.mysql.jdbc.Driver")
      .config("spark.sql.shuffle.ConnectionUrl", "jdbc:mysql://xueai:3306/hive?createDatabaseIfNoExist=UTF-8")
      .enableHiveSupport()
      .getOrCreate()

        Test1_Cq(spark)
        spark.sql("show partitions ods.customer").show()
}
  def Test1_Cq(spark: SparkSession): Unit = {
    val MyURL = "jdbc:mysql://xueai:3306/shtd_store"
    val jdbcMap = Map(
      "url" -> MyURL,
      "dbtable" -> "CUSTOMER",
      "user" -> "root",
      "password" -> "admin"
    )
    val df1 = ETL_tool.MysqlJDBC(spark, jdbcMap)
    val df2 = df1.withColumn("etl_date", lit("20240303"))
    val hiveOptions = Map("db" -> "ods", "tb" -> "customer", "lie" -> "etl_date")
    ETL_tool.hiveLoad(spark, df2, hiveOptions, "overwrite")
  }

先启动hdfs，在执行hive --service merastore，后运行代码

后续会一题一题发布，请关注期待

阿汐L

关注

14
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
职业技能大赛-大数据-数据抽取

字段排序、类型不变，同时添加静态分区，分区字段类型为String， * 且值为当前比赛日的前一天日期（分区字段格式为yyyyMMdd）。使用Spark工具，将MySQL的shtd_store库中表CUSTOMER、NATION、PART、PARTSUPP、REGION、SUPPLIER。的数据全量抽取到Hive的ods库中对应表customer，nation，part，partsupp，region，sup。先启动hdfs，在执行hive --service merastore，后运行代码。
复制链接

扫一扫