絮叨两句:
博主是一名数据分析实习生,利用博客记录自己所学的知识,也希望能帮助到正在学习的同学们
人的一生中会遇到各种各样的困难和折磨,逃避是解决不了问题的,唯有以乐观的精神去迎接生活的挑战
少年易老学难成,一寸光阴不可轻。
最喜欢的一句话:今日事,今日毕
职业标签开发
终于到了标签开发的环节,九九八十一难,最后的终点也是起点,大家继续加油
开发准备工作
↓↓↓↓↓↓↓↓↓↓点击下方链接,就可以获取POM文件和前期所要准备的工作!必点↓↓↓↓↓↓↓↓↓↓
企业级360°全方位用户画像:标签开发[匹配标签](前期准备工作)①
样例类:HBaseMeta 与 TagRule 提前定义好样例类,为了后面方便使用数据
进入web页面进行添加标签
- 启动
- 打开浏览器输入
localhost:8081
- 职业属于人口属性
- 创建四级标签
- 创建职业标签
inType=HBase##zkHosts=192.168.10.20##zkPort=2181##hbaseTable=tbl_users##family=detail##selectFields=id,job
- 创建五级标签
开始进行标签匹配
分步骤进行演示代码
①创建SparkSession对象
//创建SparkSession对象 用于读取Hbase与Mysql数据
val spark = SparkSession.builder().appName("Job").master("local[*]").getOrCreate()
val sparkContext = spark.sparkContext
sparkContext.setLogLevel("WARN")
②连接Mysql
//2.连接Mysql 用于读取四级标签与五级标签
var url="jdbc:mysql://bd001:3306/tags_new?characterEncoding=UTF-8"
var tbleName="tbl_basic_tag"//要读取那张表
val properties: Properties = new Properties()
properties.put("user","root")//mysql数据库用户名
properties.put("password","123456")//mysql数据库密码
val connectMysql = spark.read.jdbc(url, tbleName, properties)
connectMysql.show()
/**
* +---+----+--------+--------------------+--------+-----+---+-------------------+-------------------+-----+------+
* | id|name|industry| rule|business|level|pid| ctime| utime|state|remark|
* +---+----+--------+--------------------+--------+-----+---+-------------------+-------------------+-----+------+
* | 1| 电商| null| null| null| 1| -1|2019-10-25 23:31:36|2019-10-25 23:31:36| null| null|
* | 2| 某电商| null| null| null| 2| 1|2019-10-25 23:31:36|2019-10-25 23:31:36| null| null|
* | 3|人口属性| null| null| null| 3| 2|2019-10-25 23:31:36|2019-10-25 23:31:36| null| null|
* | 7| 职业| null|inType=HBase##zkH...| null| 4| 3|2019-10-27 02:56:58|2019-10-27 02:57:02| null| null|
* | 8| 学生| null| 1| null| 5| 7| null| null| null| null|
* | 9| 公务员| null| 2| null| 5| 7| null| null| null| null|
* | 10| 军人| null| 3| null| 5| 7| null| null| null| null|
* | 11| 警察| null| 4| null| 5| 7| null| null| null| null|
* | 12| 教师| null| 5| null| 5| 7| null| null| null| null|
* | 13| 白领| null| 6| null| 5| 7| null| null| null| null|
* +---+----+--------+--------------------+--------+-----+---+-------------------+-------------------+-----+------+
*/
③读取四级标签数据
val four_Tag = connectMysql.select("id", "rule").where("name='职业'")
// four_Tag.show(false)
/**
* +---+----------------------------------------------------------------------------------------------------------+
* |id |rule |
* +---+----------------------------------------------------------------------------------------------------------+
* |7 |inType=HBase##zkHosts=192.168.10.20##zkPort=2181##hbaseTable=tbl_users##family=detail##selectFields=id,job|
* +---+----------------------------------------------------------------------------------------------------------+
*/
//引入隐式转换
import spark.implicits._
//引入SparkSql内置函数
import org.apache.spark.sql.functions._
//引入java和scala 相互转换
import scala.collection.JavaConverters._
//读取职业标签ID
val four_id: Int = four_Tag.map(row => {
row.getAs("id").toString
}).collectAsList().get(0).toInt
// println("四级标签Id:",four_id)
/**
* (四级标签Id:,7)
*/
//读取职业标签Rule
val four_rule = four_Tag.map(row => {
row.getAs("rule").toString
.split("##").map(kvrow => {
val kvMap = kvrow.split("=")
(kvMap(0), kvMap(1))
})
}).collectAsList().get(0).toMap
// println("四级标签Rule:",four_rule)
/**
* (四级标签Rule:,Map(selectFields -> id,job, inType -> HBase, zkHosts -> 192.168.10.20, zkPort -> 2181, hbaseTable -> tbl_users, family -> detail))
*/
//对四级标签进行封装用户方便获取数据
val hBaseMeta = MapToHbaseRule(four_rule)
// println("将Map转换成样例类:",hBaseMeta)
/**
* (将Map转换成样例类:,HBaseMeta(HBase,192.168.10.20,2181,tbl_users,detail,id,job,))
*/
方法:MapToHbaseRule—>用于封装Map
def MapToHbaseRule(four_rule: Map[String, String]) = {
HBaseMeta(inType = four_rule.getOrElse("inType",""),
zkHosts = four_rule.getOrElse("zkHosts",""),
zkPort = four_rule.getOrElse("zkPort",""),
hbaseTable = four_rule.getOrElse("hbaseTable",""),
family = four_rule.getOrElse("family",""),
selectFields = four_rule.getOrElse("selectFields",""),
rowKey = four_rule.getOrElse("rowKey",""))
}
④读取五级标签数据:
//读取五级标签
val five_Tag = connectMysql.select("id", "rule").where("pid=" + four_id)
// five_Tag.show(false)
/**
* +---+----+
* |id |rule|
* +---+----+
* |8 |1 |
* |9 |2 |
* |10 |3 |
* |11 |4 |
* |12 |5 |
* |13 |6 |
* +---+----+
*/
⑤读取Hbase数据 根据第3步处理好的数据用来读取Hbase数据
//5. 读取Hbase数据
val hbase_users = spark.read.format("cn.itcast.up29.tools.HBaseSource")
.option(HBaseMeta.ZKHOSTS, hBaseMeta.zkHosts)
.option(HBaseMeta.ZKPORT, hBaseMeta.zkPort)
.option(HBaseMeta.FAMILY, hBaseMeta.family)
.option(HBaseMeta.HBASETABLE, hBaseMeta.hbaseTable)
.option(HBaseMeta.SELECTFIELDS, hBaseMeta.selectFields)
.load()
hbase_users.show()
/**
* +---+---+
* | id|job|
* +---+---+
* | 1| 3|
* | 10| 5|
* |100| 3|
* |101| 1|
* |102| 1|
* |103| 3|
* |104| 6|
* |105| 2|
* |106| 4|
* |107| 1|
* |108| 4|
* |109| 6|
* | 11| 6|
* |110| 4|
* |111| 1|
* |112| 1|
* |113| 6|
* |114| 1|
* |115| 4|
* |116| 6|
* +---+---+
* only showing top 20 rows
*
*
* Process finished with exit code 0
*/
⑥将五级标签与tbl_users[用户表]进行匹配
//6. 将五级标签与Hbase数据进行匹配
val new_Tag = hbase_users.join(five_Tag, five_Tag.col("rule") === hbase_users.col("job"))
.select('id.as("userId"), 'fiveid.as("tagsId"))
/**
* +------+------+
* |userId|tagsId|
* +------+------+
* | 1| 10|
* | 100| 10|
* | 103| 10|
* | 117| 10|
* | 12| 10|
* | 120| 10|
* | 124| 10|
* | 126| 10|
* | 130| 10|
* | 131| 10|
* | 134| 10|
* | 136| 10|
* | 141| 10|
*/
⑦读取Hbase的历史数据,将新数据与老数据合并
//7.读取Hbase的历史数据,将新数据与老数据合并
// 考虑,hbase中最终标签表里已经有数据了,直接将新的数据写入,会发生什么问题? 答:会覆盖
//考虑,现在已经通过追加的方式解决了覆盖的问题,如相同的程序多跑几次会发生什么问题? 答会重复
//重复问题的解决办法,在追加数据之后,进程一次去重操作就可以了
val old_Tag: DataFrame = spark.read.format("cn.itcast.up29.tools.HBaseSource")
.option(HBaseMeta.ZKHOSTS, hBaseMeta.zkHosts)
.option(HBaseMeta.ZKPORT, hBaseMeta.zkPort)
.option(HBaseMeta.FAMILY, hBaseMeta.family)
.option(HBaseMeta.HBASETABLE, "test")
.option(HBaseMeta.SELECTFIELDS, "userId,tagsId")
.load()
// old_Tag.show()
/**
* 还没有写入数据,
* +------+------+
* |userId|tagsId|
* +------+------+
* +------+------+
*/
//7.1开始合并数据
if (old_Tag.count() == 0) {
//证明还没有数据直接将数据写入
new_Tag.write.format("cn.itcast.userprofile.up24.tools.HBaseDataSource")
.option(HBaseMeta.ZKHOSTS, hBaseMeta.zkHosts)
.option(HBaseMeta.ZKPORT, hBaseMeta.zkPort)
.option(HBaseMeta.FAMILY, hBaseMeta.family)
.option(HBaseMeta.HBASETABLE, "test")
.option(HBaseMeta.SELECTFIELDS, "userId,tagsId")
.save()
} else {
val append_Tag: UserDefinedFunction = udf((old_T: String, new_T: String) => {
println(old_T, new_T)
if (old_T == "") {
new_T
} else if (new_T == "") {
old_T
} else if (old_T == "" && new_T == "") {
""
} else {
val all_T = old_T + "," + new_T
//进行去重
val all_TAG = all_T.split(",").distinct.mkString(",")
all_TAG
}
})
val old_Append_new: DataFrame = old_Tag.join(new_Tag, old_Tag.col("userId") === new_Tag.col("userId"))
.select(
when(old_Tag.col("userId").isNotNull, old_Tag.col("userId"))
.when(new_Tag.col("userId").isNotNull, new_Tag.col("userId")).as("userId"),
append_Tag(old_Tag.col("tagsId"), new_Tag.col("tagsId")).as("tagsId"))
old_Append_new.show()
/**
* +------+----------+
* |userId| tagsId|
* +------+----------+
* | 296| 18,13,5|
* | 467| 17,13,6|
* | 675| 19,10,6|
* | 691| 18,8,5|
* | 829| 17,12,5|
* | 125| 20,12,6|
* | 451| 20,9,6|
* | 800| 18,13,5|
* | 853| 20,11,6|
* | 944| 17,10,6|
* | 666| 17,9,6|
* | 870| 18,13,5|
* | 919| 18,8,6|
* | 926| 17,8,6|
* | 124| 19,10,5|
* | 447| 18,12,5|
* | 51|28,18,11,5|
* | 591| 19,10,5|
* | 7|28,17,12,6|
* | 307| 19,11,6|
* +------+----------+
*/
⑧将最终结果写入
//8.将最终结果写入到Hbase中
old_Append_new.write.format("cn.itcast.up29.tools.HBaseSource")
.option(HBaseMeta.ZKHOSTS, hBaseMeta.zkHosts)
.option(HBaseMeta.ZKPORT, hBaseMeta.zkPort)
.option(HBaseMeta.FAMILY, hBaseMeta.family)
.option(HBaseMeta.HBASETABLE, "test")
.option(HBaseMeta.SELECTFIELDS, "userId,tagsId")
.save()
完整代码
package cn.itcast.up29.matchtag
import java.util.Properties
import cn.itcast.up29.bean.HBaseMeta
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.{DataFrame, SparkSession}
object Job_Tag {
def main(args: Array[String]): Unit = {
//1.创建SparkSession对象 用于读取Hbase与Mysql数据
val spark = SparkSession.builder().appName("Job").master("local[*]").getOrCreate()
val sparkContext = spark.sparkContext
sparkContext.setLogLevel("WARN")
//2.连接Mysql 用于读取四级标签与五级标签
var url = "jdbc:mysql://bd001:3306/tags_new?characterEncoding=UTF-8"
var tbleName = "tbl_basic_tag" //要读取那张表
val properties: Properties = new Properties()
properties.put("user", "root") //mysql数据库用户名
properties.put("password", "123456") //mysql数据库密码
val connectMysql = spark.read.jdbc(url, tbleName, properties)
// connectMysql.show()
/**
* +---+----+--------+--------------------+--------+-----+---+-------------------+-------------------+-----+------+
* | id|name|industry| rule|business|level|pid| ctime| utime|state|remark|
* +---+----+--------+--------------------+--------+-----+---+-------------------+-------------------+-----+------+
* | 1| 电商| null| null| null| 1| -1|2019-10-25 23:31:36|2019-10-25 23:31:36| null| null|
* | 2| 某电商| null| null| null| 2| 1|2019-10-25 23:31:36|2019-10-25 23:31:36| null| null|
* | 3|人口属性| null| null| null| 3| 2|2019-10-25 23:31:36|2019-10-25 23:31:36| null| null|
* | 7| 职业| null|inType=HBase##zkH...| null| 4| 3|2019-10-27 02:56:58|2019-10-27 02:57:02| null| null|
* | 8| 学生| null| 1| null| 5| 7| null| null| null| null|
* | 9| 公务员| null| 2| null| 5| 7| null| null| null| null|
* | 10| 军人| null| 3| null| 5| 7| null| null| null| null|
* | 11| 警察| null| 4| null| 5| 7| null| null| null| null|
* | 12| 教师| null| 5| null| 5| 7| null| null| null| null|
* | 13| 白领| null| 6| null| 5| 7| null| null| null| null|
* +---+----+--------+--------------------+--------+-----+---+-------------------+-------------------+-----+------+
*/
//3.读取四级标签数据
val four_Tag = connectMysql.select("id", "rule").where("name='职业'")
// four_Tag.show(false)
/**
* +---+----------------------------------------------------------------------------------------------------------+
* |id |rule |
* +---+----------------------------------------------------------------------------------------------------------+
* |7 |inType=HBase##zkHosts=192.168.10.20##zkPort=2181##hbaseTable=tbl_users##family=detail##selectFields=id,job|
* +---+----------------------------------------------------------------------------------------------------------+
*/
//引入隐式转换
import spark.implicits._
//引入SparkSql内置函数
import org.apache.spark.sql.functions._
//引入java和scala 相互转换
import scala.collection.JavaConverters._
val four_id: Int = four_Tag.map(row => {
row.getAs("id").toString
}).collectAsList().get(0).toInt
// println("四级标签Id:",four_id)
/**
* (四级标签Id:,7)
*/
val four_rule = four_Tag.map(row => {
row.getAs("rule").toString
.split("##").map(kvrow => {
val kvMap = kvrow.split("=")
(kvMap(0), kvMap(1))
})
}).collectAsList().get(0).toMap
// println("四级标签Rule:",four_rule)
/**
* (四级标签Rule:,Map(selectFields -> id,job, inType -> HBase, zkHosts -> 192.168.10.20, zkPort -> 2181, hbaseTable -> tbl_users, family -> detail))
*/
//对四级标签进行封装用户方便获取数据
val hBaseMeta = MapToHbaseRule(four_rule)
// println("将Map转换成样例类:",hBaseMeta)
/**
* (将Map转换成样例类:,HBaseMeta(HBase,192.168.10.20,2181,tbl_users,detail,id,job,))
*/
//------------------------------
//4.读取五级标签
val five_Tag = connectMysql.select('id.as("fiveid"), 'rule).where("pid=" + four_id)
// five_Tag.show(false)
/**
* +---+----+
* |id |rule|
* +---+----+
* |8 |1 |
* |9 |2 |
* |10 |3 |
* |11 |4 |
* |12 |5 |
* |13 |6 |
* +---+----+
*/
//5. 读取Hbase数据
val hbase_users = spark.read.format("cn.itcast.up29.tools.HBaseSource")
.option(HBaseMeta.ZKHOSTS, hBaseMeta.zkHosts)
.option(HBaseMeta.ZKPORT, hBaseMeta.zkPort)
.option(HBaseMeta.FAMILY, hBaseMeta.family)
.option(HBaseMeta.HBASETABLE, hBaseMeta.hbaseTable)
.option(HBaseMeta.SELECTFIELDS, hBaseMeta.selectFields)
.load()
// hbase_users.show()
/**
* +---+---+
* | id|job|
* +---+---+
* | 1| 3|
* | 10| 5|
* |100| 3|
* |101| 1|
* |102| 1|
* |103| 3|
* |110| 4|
* |111| 1|
* |112| 1|
* |113| 6|
* |114| 1|
* |115| 4|
* |116| 6|
* +---+---+
*/
//6. 将五级标签与Hbase数据进行匹配
val new_Tag = hbase_users.join(five_Tag, five_Tag.col("rule") === hbase_users.col("job"))
.select('id.as("userId"), 'fiveid.as("tagsId"))
/**
* +------+------+
* |userId|tagsId|
* +------+------+
* | 1| 10|
* | 100| 10|
* | 103| 10|
* | 117| 10|
* | 12| 10|
* | 120| 10|
* | 124| 10|
* | 126| 10|
* | 130| 10|
* | 131| 10|
* | 134| 10|
* | 136| 10|
* | 141| 10|
*/
//7.读取Hbase的历史数据,将新数据与老数据合并
// 考虑,hbase中最终标签表里已经有数据了,直接将新的数据写入,会发生什么问题? 答:会覆盖
//考虑,现在已经通过追加的方式解决了覆盖的问题,如相同的程序多跑几次会发生什么问题? 答会重复
//重复问题的解决办法,在追加数据之后,进程一次去重操作就可以了
val old_Tag: DataFrame = spark.read.format("cn.itcast.up29.tools.HBaseSource")
.option(HBaseMeta.ZKHOSTS, hBaseMeta.zkHosts)
.option(HBaseMeta.ZKPORT, hBaseMeta.zkPort)
.option(HBaseMeta.FAMILY, hBaseMeta.family)
.option(HBaseMeta.HBASETABLE, "test")
.option(HBaseMeta.SELECTFIELDS, "userId,tagsId")
.load()
// old_Tag.show()
/**
* 还没有写入数据,
* +------+------+
* |userId|tagsId|
* +------+------+
* +------+------+
*/
//7.1开始合并数据
if (old_Tag.count() == 0) {
//证明还没有数据直接将数据写入
new_Tag.write.format("cn.itcast.up29.tools.HBaseSource")
.option(HBaseMeta.ZKHOSTS, hBaseMeta.zkHosts)
.option(HBaseMeta.ZKPORT, hBaseMeta.zkPort)
.option(HBaseMeta.FAMILY, hBaseMeta.family)
.option(HBaseMeta.HBASETABLE, "test")
.option(HBaseMeta.SELECTFIELDS, "userId,tagsId")
.save()
} else {
val append_Tag: UserDefinedFunction = udf((old_T: String, new_T: String) => {
// println(old_T, new_T)
if (old_T == "") {
new_T
} else if (new_T == "") {
old_T
} else if (old_T == "" && new_T == "") {
""
} else {
val all_T = old_T + "," + new_T
//进行去重
val all_TAG = all_T.split(",").distinct.mkString(",")
all_TAG
}
})
val old_Append_new: DataFrame = old_Tag.join(new_Tag, old_Tag.col("userId") === new_Tag.col("userId"))
.select(
when(old_Tag.col("userId").isNotNull, old_Tag.col("userId"))
.when(new_Tag.col("userId").isNotNull, new_Tag.col("userId")).as("userId"),
append_Tag(old_Tag.col("tagsId"), new_Tag.col("tagsId")).as("tagsId"))
// old_Append_new.show()
/**
* +------+----------+
* |userId| tagsId|
* +------+----------+
* | 296| 18,13,5|
* | 467| 17,13,6|
* | 675| 19,10,6|
* | 691| 18,8,5|
* | 829| 17,12,5|
* | 125| 20,12,6|
* | 451| 20,9,6|
* | 800| 18,13,5|
* | 853| 20,11,6|
* | 944| 17,10,6|
* | 666| 17,9,6|
* | 870| 18,13,5|
* | 919| 18,8,6|
* | 926| 17,8,6|
* | 124| 19,10,5|
* | 447| 18,12,5|
* | 51|28,18,11,5|
* | 591| 19,10,5|
* | 7|28,17,12,6|
* | 307| 19,11,6|
* +------+----------+
*/
//8.将最终结果写入到Hbase中
old_Append_new.write.format("cn.itcast.up29.tools.HBaseSource")
.option(HBaseMeta.ZKHOSTS, hBaseMeta.zkHosts)
.option(HBaseMeta.ZKPORT, hBaseMeta.zkPort)
.option(HBaseMeta.FAMILY, hBaseMeta.family)
.option(HBaseMeta.HBASETABLE, "test")
.option(HBaseMeta.SELECTFIELDS, "userId,tagsId")
.save()
}
}
def MapToHbaseRule(four_rule: Map[String, String]) = {
HBaseMeta(inType = four_rule.getOrElse("inType",""),
zkHosts = four_rule.getOrElse("zkHosts",""),
zkPort = four_rule.getOrElse("zkPort",""),
hbaseTable = four_rule.getOrElse("hbaseTable",""),
family = four_rule.getOrElse("family",""),
selectFields = four_rule.getOrElse("selectFields",""),
rowKey = four_rule.getOrElse("rowKey",""))
}
}
总结
职业标签到这里就开发结束
重要事情划重点:
- 创建sparksession 对象 用于读取Mysql和Hbase数据库
- 连接Mysql 用于读取四级标签与五级标签
- 读取四级标签
- 读取五级标签
- 读取Hbase数据 根据第3步处理好的数据用来读取Hbase数据
- 将五级标签与tbl_users[用户表]进行匹配
- 读取Hbase的历史数据,将新数据与老数据合并
- 将最终数据写入到Hbase
感谢大家的支持,若能帮助到你,记得点赞支持一下