絮叨两句:
博主是一名数据分析实习生,利用博客记录自己所学的知识,也希望能帮助到正在学习的同学们
人的一生中会遇到各种各样的困难和折磨,逃避是解决不了问题的,唯有以乐观的精神去迎接生活的挑战
少年易老学难成,一寸光阴不可轻。
最喜欢的一句话:今日事,今日毕
文章目录
前言
本章节提供开发标签需要准备的工具类,以及表的说明
用户画像分类(技术)
匹配类:根据用户基本信息进行匹配,完善标签体系.例如:性别,籍贯,学历
统计类:根据用户基本信息或行为数据进行统计完善标签系统.例如:年龄段,客单价,单价最高,退货率
挖掘类:根据用户行为数据挖掘完善标签系统.例如:商品偏好,消费能力,品牌偏好
人口属性
商业属性
行为属性
用户价值
组合标签
业务数据表说明
标签开发流程:
开发思路:
- 从MySQL中获取4级和5级的数据:id和rule
- 从4级rule中获取HBase数据源信息
- 从5级rule中获取匹配规则
- 加载HBase数据源
- 根据需求进行标签计算
- 数据落地
第一步添加标签
添加四级标签
规则标签:
inType=HBase##zkHosts=192.168.10.20##zkPort=2181##hbaseTable=tbl_users##family=detail##selectFields=id,gender
inType=HBase | 类型:hbase |
zkHosts=192.168.10.20 | ZK的ip:192.168.10.20 |
zkPort=2181 | ZK的ip:2181 |
hbaseTable=tbl_users | hbase表:tbl_users |
family=detail | 列族:detail |
selectFields=id,gender | 需要读取的列:id,gender |
添加五级标签
Hbase数据中用户性别的标识使用1(男)和2(女)标识
因此在创建标签时,标签的规则需要与数据完全对应
准备工作
导入POM依赖
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
<hbase.version>1.2.0-cdh5.14.0</hbase.version>
<solr.version>4.10.3-cdh5.14.0</solr.version>
<mysql.version>8.0.17</mysql.version>
<slf4j.version>1.7.21</slf4j.version>
<maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
<build-helper-plugin.version>3.0.0</build-helper-plugin.version>
<scala-compiler-plugin.version>3.2.0</scala-compiler-plugin.version>
<maven-shade-plugin.version>3.2.1</maven-shade-plugin.version>
</properties>
<dependencies>
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_2.11</artifactId>
<version>0.13</version>
</dependency>
<!-- HBase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
<!-- Solr -->
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-core</artifactId>
<version>${solr.version}</version>
</dependency>
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-solrj</artifactId>
<version>${solr.version}</version>
</dependency>
<!-- MySQL -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>${mysql.version}</version>
</dependency>
<!-- Logging -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>cn.itcast.up29</groupId>
<artifactId>common</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<version>${build-helper-plugin.version}</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>add-source</goal>
</goals>
<configuration>
<sources>
<source>src/main/java</source>
<source>src/main/scala</source>
</sources>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
<configuration>
<encoding>UTF-8</encoding>
<source>1.8</source>
<target>1.8</target>
<verbose>true</verbose>
<fork>true</fork>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>${scala-compiler-plugin.version}</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<!--这里要替换成jar包main方法所在类 -->
<mainClass>cn.userprofile.TestTag</mainClass>
</manifest>
<manifestEntries>
<Class-Path>.</Class-Path>
</manifestEntries>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- 指定在打包节点执行jar包合并操作 -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
HBase元数据样例类
代码中会解释为什么要使用样例类!
case class HBaseMeta (
inType: String,
zkHosts: String,
zkPort: String,
hbaseTable: String,
family: String,
selectFields: String,
rowKey: String
)
object HBaseMeta{
val INTYPE = "inType"
val ZKHOSTS = "zkHosts"
val ZKPORT = "zkPort"
val HBASETABLE = "hbaseTable"
val FAMILY = "family"
val SELECTFIELDS = "selectFields"
val ROWKEY = "rowKey"
}
----------------------------------------------------------------------------------------
case class TagRule(
id:Int,
rule:String
)
HBase数据源source
package cn.userprofile.tools
import cn.itcast.userprofile.HBaseMeta
import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode}
import org.apache.spark.sql.sources.{BaseRelation, CreatableRelationProvider, DataSourceRegister, RelationProvider}
/**
* 自定义HBase数据源
*/
class HBaseDataSource extends RelationProvider with CreatableRelationProvider with DataSourceRegister with Serializable {
/**
* 读取数据源.
*
* @param sqlContext
* @param parameters 在调用当前DataSource的时候传入的option键值对.
* @return
*/
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
//将parameters里面的HBase相关的数据都取出来.
//将map封装为HBaseMeta对象
val meta: HBaseMeta = parseMeta(parameters)
new HBaseReadableRelation(sqlContext, meta)
}
/**
* 将数据写入到指定的位置,数据落地
* @param sqlContext
* @param mode 覆盖/追加
* @param parameters 构建数据源的时候添加的参数.
* @param data 需要保存的数据
* @return
*/
override def createRelation(sqlContext: SQLContext,
mode: SaveMode,
parameters: Map[String, String],
data: DataFrame): BaseRelation = {
//先解析参数,将参数构建成HBaseMeta
val meta: HBaseMeta = parseMeta(parameters)
//创建HBaseWritableRelation对象
val relation = new HBaseWritableRelation(sqlContext, meta, data)
relation.insert(data, true)
//将对返回
relation
}
/**
* 将Map转换为HBaseMeta
* @param params
* @return
*/
def parseMeta(params: Map[String, String]): HBaseMeta = {
HBaseMeta(
params.getOrElse(HBaseMeta.INTYPE, ""),
params.getOrElse(HBaseMeta.ZKHOSTS, ""),
params.getOrElse(HBaseMeta.ZKPORT, ""),
params.getOrElse(HBaseMeta.HBASETABLE, ""),
params.getOrElse(HBaseMeta.FAMILY, ""),
params.getOrElse(HBaseMeta.SELECTFIELDS, ""),
params.getOrElse(HBaseMeta.ROWKEY, "")
)
}
override def shortName(): String = "hbase"
}
HBase数据源HBaseReadableRelation
package cn.userprofile.tools
import cn.userprofile.bean.HBaseMeta
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.sources.{BaseRelation, TableScan}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
/**
* 真正读取HBase数据源的Relation
*/
class HBaseReadableRelation(context: SQLContext, meta: HBaseMeta) extends BaseRelation with TableScan with Serializable {
//定义sqlContext
override def sqlContext: SQLContext = context
//定义数据结构的schema 里面定义列名/列的类型/当前列是否可以为空.
override def schema: StructType = {
//构建一个StructType,所有列的元数据信息(id:name,type,true rule:name,type,false)
//meta.selectFields => 从MySQL的4级标签获取到的.selectFields -> id,job
val fields: Array[StructField] = meta.selectFields.split(",")
.map(fieldName => {
StructField(fieldName, StringType, true)
})
//使用Fields构建StructType
StructType(fields)
}
/**
* 构建数据源,我们可以自己定义从HBase中拿到的数据,封装为Row返回,
* @return
*/
override def buildScan(): RDD[Row] = {
//数据在哪?HBase
//我们要返回什么数据?RDD[Row]
//定义HBase相关的conf
val conf = new Configuration()
conf.set("hbase.zookeeper.property.clientPort", meta.zkPort)
conf.set("hbase.zookeeper.quorum", meta.zkHosts)
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, meta.hbaseTable)
//从hadoop中构建我们的数据源RDD
val sourceRDD: RDD[(ImmutableBytesWritable, Result)] = context.sparkContext.newAPIHadoopRDD(
conf,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result]
)
//获取Result数据
val resultRDD: RDD[Result] = sourceRDD.map(_._2)
//从result中获取我们需要的字段,别的字段都不要.
//map == 将result => row
resultRDD.map(result => {
//获取列的名字,我们可以使用selectFields进行切割.
val seq: Seq[String] = meta.selectFields.split(",")
//将列名转换为row,一行数据
//将String列名=> 具体的列的值.
.map(fieldName => {
//如果使用result获取数据,数据类型默认为byte数组,需要使用HBase的工具类Bytes将数据转换为String.
Bytes.toString(result.getValue(meta.family.getBytes, fieldName.getBytes))
}).toSeq
//将列的值封装成row
Row.fromSeq(seq)
})
}
}
HBase数据源HBaseWritableRelation
package cn.userprofile.tools
import cn.userprofile.bean.HBaseMeta
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.client.{Delete, Get, Put, Result}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.sources.{BaseRelation, InsertableRelation}
import org.apache.spark.sql.types.StructType
/**
* 数据写入的Relation
*/
class HBaseWritableRelation(context: SQLContext, meta: HBaseMeta, data: DataFrame) extends BaseRelation with InsertableRelation with Serializable {
override def sqlContext: SQLContext = context
override def schema: StructType = data.schema
/**
* 将data数据插入到HBase中.
*
* @param data
* @param overwrite
*/
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
//构建HBase相关的配置.
val config = new Configuration()
config.set("hbase.zookeeper.property.clientPort", meta.zkPort)
config.set("hbase.zookeeper.quorum", meta.zkHosts)
config.set("zookeeper.znode.parent", "/hbase-unsecure")
config.set("mapreduce.output.fileoutputformat.outputdir", "/test01")
config.set(TableOutputFormat.OUTPUT_TABLE, meta.hbaseTable)
val job = Job.getInstance(config)
job.setOutputKeyClass(classOf[ImmutableBytesWritable])
job.setOutputValueClass(classOf[Result])
job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
data
//将DataFrame转换为RDD
.rdd
//将并行度设置为1
.coalesce(1)
//将每一行数据row => 插入HBase的put
.map(row => {
//向HBase中存入数据需要的是Put对象.
// new Delete(rowkey)
// new Get(rowkey)
//构建put对象的时候需要指定rowkey,我们可以用用户ID作为rowkey.
val rowkey: String = row.getAs("userId").toString
val put = new Put(rowkey.getBytes)
//我们需要向put中插入列.
meta.selectFields.split(",")
.map(fieldName => {
//获取当前列的值
val value: String = row.getAs(fieldName).toString
//向Put中添加列
put.addColumn(meta.family.getBytes, fieldName.getBytes, value.getBytes)
})
(new ImmutableBytesWritable, put)
})
//将数据进行保存
.saveAsNewAPIHadoopDataset(job.getConfiguration)
}
}
总结
本章节为大家详细的讲解了标签都有哪些以及添加和存储的流程
开发的流程要及牢固
- 用于读取和写入Hbase的工具类
- HBase数据源source
- HBase数据源HBaseReadableRelation
- HBase数据源HBaseWritableRelation
以上内容是标签开发准备工作
若有什么正确的地方还请及时反馈,博主及时更正
如能帮助到你,希望能点个赞支持一下谢谢!