基于Scala开发的spark临界点均值法填充缺失值的job
大体思路是这样的:
1.读取mysql数据,转换成dataframe
2.通过sql增加name和password的均值两个列,分别为name_avg,password_avg
3.name_avg替换name的空值,password_avg替换password的空值
涉及到的点:
1.开窗函数
avg(password) over(rows between 4 preceding and 4 following) 上下4列
2.调优,
case when,防止不必要的计算
3,coalesce函数的使用
package sparksql
import org.apache.spark.SparkConf
import org.apache.spark.sql.catalyst.plans.physical.SinglePartition.numPartitions
import org.apache.spark.sql.functions.{coalesce, col}
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* create:weihh
* time:2022年9月22日08:23:43
* 临界平均值填充空值job
*/
object Spark_test {
def main(args: Array[String]): Unit = {
//创建SparkSQL的运行环境
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("sparkSQL")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
// 读取MySQL数据
val df = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://127.0.0.1:3306/test?serverTimezone=UTC")
.option("driver", "com.mysql.jdbc.Driver")
.option("user", "root")
.option("password", "123456")
// .option("numPartitions", 20)
// .option("partitionColumn", "part_num")
.option("dbtable", "user")
.load()
//创建临时表
df.createTempView("t")
//处理password字段和name字段的临界平均值,临界值小于等于4
//增加name对应的一列name的临界平均值列,增加password对应的一列的password的临界平均值
//用case when,可以过滤name值为空的时候,才计算平均值。如果不用case when,当数据大部分不为空的时候,产生大量不必要的计算
val df1 = df.sqlContext
.sql("select id,username,password,name," +
"case when name is null then avg(name) over(rows between 4 preceding and 4 following) else null end as name_avg," +
"case when password is null then avg(password) over(rows between 4 preceding and 4 following) else null end as password_avg from t")
val df2 = df1.withColumn("name",coalesce(col("name"),col("name_avg")))
val df3 = df2.withColumn("password",coalesce(col("password"),col("password_avg")))
df3.show()
// // 保存数据
// df.write
// .format("jdbc")
// .option("url", "jdbc:mysql://linux1:3306/spark-sql")
// .option("driver", "com.mysql.jdbc.Driver")
// .option("user", "root")
// .option("password", "123123")
// .option("dbtable", "user1")
// .mode(SaveMode.Append)
// .save()
// TODO 关闭环境
spark.close()
}
}
关于调优:
读取mysql调优方式参见https://blog.csdn.net/weixin_43370502/article/details/122947922
spark调优适当调节executor个数,executor core个数,driver内存,executor内存
结果
输入:
输出:
只看password和name两列就行