离线数据处理任务二：数据清洗

taoyundao_1

于 2023-12-31 08:43:23 发布

阅读量995

点赞数 17

文章标签： hive

本文链接：https://blog.csdn.net/taoyundao_1/article/details/135312996

版权

任务二大数据清洗

陶运道 QQ：275931339

一、数据清洗：数据清洗是指对数据进行重新审查和校验，在这一过程中，发现并纠正数据文件中可识别错误，按照一定规则把错误或冲突数据洗掉，包括检查数据一致性，处理无效值和缺失值等。

二、清洗要求：编写Scala代码，使用Spark将ods库中相应表数据全量抽取到Hive的dwd库中对应表中。表中有涉及到timestamp类型的，均要求按照yyyy-MM-dd HH:mm:ss，不记录毫秒数，若原数据中只有年月日，则在时分秒的位置添加00:00:00，添加之后使其符合yyyy-MM-dd HH:mm:ss。(若dwd库中部分表没有数据，正常抽取即可)

抽取ods库中customer_inf表中昨天的分区数据，并结合dwd.customer_inf最新分区现有的数据，根据customer_id合并数据到dwd库中dwd.customer1_inf的分区表（合并是指对dwd层数据进行插入或修改，需修改的数据以customer_id为合并字段，根据modified_time排序取最新的一条），分区字段为etl_date且值与ods库的相对应表该值相等，添加dwd_insert_user、dwd_insert_time、dwd_modify_user、dwd_modify_time四列,其中dwd_insert_user、dwd_modify_user均填写“user1”。若该条记录第一次进入数仓dwd层则dwd_insert_time、dwd_modify_time均存当前操作时间，并进行数据类型转换。若该数据在进入dwd层时发生了合并修改，则dwd_insert_time时间不变，dwd_modify_time存当前操作时间，其余列存最新的值。使用hive cli执行show partitions dwd.dim_user_info命令。

三、解题思路

dwd.customer_inf 存放了以前存量数据如图所示

customer_id customer_name etl_date

19990 庞帅 20221230

19991 曾凤兰 20221230

19992 陈涛 20221230

19993 朱斌 20221230

19994 范萍 20221230

19995 姜洋 20221230

19996 黄婷 20221230

19997 解娜 20221230

19998 赵丹丹 20221230

ods_customer_inf存放了所有到昨天为止的数据

19988 陈玲 20221230

19989 程琴 20221230

19990 庞帅 20221230

19991 曾凤兰 20221230

19992 陈涛 20221230

19993 朱斌 20221230

19994 范萍 20221230

19995 姜洋 20221230

19996 黄婷 20221230

19997 解娜 20221230

19998 赵丹丹 20221230

19999 AAA 20231230

20000 BBB 20231230

20001 CCC 20231230

从图中可以看出此表多了三条记录

dwd.customer1_inf 用于分析数据分析表（整个数据）

19988 陈玲 20231230

19989 程琴 20231230

19990 庞帅 20231230

19991 曾凤兰 20231230

19992 陈涛 20231230

19993 朱斌 20231230

19994 范萍 20231230

19995 姜洋 20231230

19996 黄婷 20231230

19997 解娜 20231230

19998 赵丹丹 20231230

19999 AAA 20231230

20000 BBB 20231230

20001 CCC 20231230

(1)创建表ods.customer_inf 用于抽取mysql数据

Hive> CREATE TABLE ods.customer_inf (customer_inf_id INT, customer_id INT,customer_name STRING,identity_card_type TINYINT,identity_card_no string,mobile_phone string,customer_email STRING,gender STRING,customer_point INT,register_time TIMESTAMP,birthday STRING,customer_level TINYINT, customer_money DOUBLE,modified_time TIMESTAMP) PARTITIONED BY (etl_date STRING) row format delimited fields terminated by ','

（2）创建表dwd.customer_inf 用于第一次全量抽取及清洗

Hive> create table dwd.customer_inf like ods.customer_inf

（3）dwd.customer1_inf 用于将合并数据存入

Hive> create table dwd.customer1_inf like ods.customer_inf

四、实验过程

1.按要求全量抽取到dwd库中

编写sparkAPI 进行数据处理

import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.{SaveMode, SparkSession}

import org.apache.spark.sql.functions.{col, desc, format_string, from_unixtime, lead, lit, row_number, to_timestamp, unix_timestamp, when}

   val spark: SparkSession = SparkSession.builder().appName("abc").master("local[*]")

.config("hive.metastore.uris", "thrift://192.168.100.100:9083")

  .config("spark.sql.metastore.warehouse","hdfs://master:9000/user/hive/warehouse")

  .enableHiveSupport()

  .getOrCreate()

    val df = spark.read

  .format("jdbc")

  .option("driver", "com.mysql.cj.jdbc.Driver")

  .option("url", "jdbc:mysql://192.168.100.100:3306/ds_db01?serverTimezone=GMT")

  .option("dbtable", "customer_inf")

  .option("user", "root")

  .option("password", "123456")

  .load().createOrReplaceTempView("t1")

(1)全量mysql抽取数据至dwd.customer_inf

spark.read

.format("jdbc")

.option("driver", "com.mysql.cj.jdbc.Driver")

.option("url", "jdbc:mysql://master:3306/ds_db01?serverTimezone=GMT")

.option("user", "root")

.option("password", "123456")

.option("dbtable", "product_info")

.load()

.createOrReplaceTempView("t1")

spark.sql("select * from t1").show()

//全量抽取
spark.sql(
s"""
|insert overwrite table ods.customer_inf
|partition (etl_date="20231229")
|select * from t1
|""".stripMargin).show()

spark.sql("select * from ods.customer_inf").show

创建dwd.customer.inf表

//清洗代码

spark.read.table("ods.customer_inf")

.withColumn("dwd_insert_user",lit("user1"))

.withColumn("dwd_insert_time",to_timestamp(from_unixtime(unix_timestamp(),"yyyy-MM-dd HH:mm:ss")))

.withColumn("dwd_modify_user",lit("user1"))

.withColumn("dwd_modify_time",to_timestamp(from_unixtime(unix_timestamp(),"yyyy-MM-dd HH:mm:ss")))

.withColumn("birthday",to_timestamp(from_unixtime(unix_timestamp(col("birthday"),"yyyy-MM-dd HH:mm:ss"))))

.write

.mode(SaveMode.Overwrite)

.partitionBy("etl_date")

.saveAsTable("dwd.customer_inf")

查看抽取内容

sparkSession.sql("select * from dwd.customer_inf").show()

（2）在表ods.customer_inf中插入若干记录，然后增量清洗

(1) 在MySQL表中增加三条记录

增加记录

增量抽取

spark.sql(
s"""
|insert overwrite table ods.customer_inf
|partition (etl_date="20231230")
|select * from t1 where customer_id>19998
|""".stripMargin).show()

注此时ods.customer_inf有两个分区。

（3）清洗ods 库中user_info 表数据（整个数据）

val ods_user_info=sparkSession.table("ods.customer_inf")

.withColumn("dwd_insert_user",lit("user1"))

.withColumn("dwd_insert_time",to_timestamp(from_unixtime(unix_timestamp(),"yyyy-MM-dd HH:mm:ss")))

.withColumn("dwd_modify_user",lit("user1"))

.withColumn("dwd_modify_time",to_timestamp(from_unixtime(unix_timestamp(),"yyyy-MM-dd HH:mm:ss")))

.withColumn("birthday",to_timestamp(from_unixtime(unix_timestamp(col("birthday"),"yyyy-MM-dd HH:mm:ss"))))

.withColumn("operate_time",when(col("operate_time").isNull,col("create_time")).otherwise(col("operate_time")))

.withColumn("etl_date",lit("20231211"))

（4）将ods.customer_inf 与dwd.customer_inf表数据合并并去重，数据存入表dwd.customer1_inf

// 取表数据