企业级360°全方位用户画像:标签开发（前期准备工作）①

最新推荐文章于 2023-03-28 14:36:08 发布

weixin_43563705

最新推荐文章于 2023-03-28 14:36:08 发布

阅读量1.4k

点赞数 7

分类专栏： # 企业级360°全方位用户画像文章标签：用户画像

本文链接：https://blog.csdn.net/weixin_43563705/article/details/108766655

版权

企业级360°全方位用户画像专栏收录该内容

24 篇文章 43 订阅

订阅专栏

絮叨两句:
博主是一名数据分析实习生,利用博客记录自己所学的知识,也希望能帮助到正在学习的同学们
人的一生中会遇到各种各样的困难和折磨，逃避是解决不了问题的，唯有以乐观的精神去迎接生活的挑战
少年易老学难成，一寸光阴不可轻。
最喜欢的一句话:今日事,今日毕

页数	名称
上一页	企业级360°全方位用户画像:业务数据导入[七]
	企业级360°全方位用户画像:标签开发（前期准备工作）①
匹配标签	匹配标签开发:性别标签匹配标签开发:职业标签匹配标签开发:是否黑名单匹配标签开发:国籍
统计标签	统计型标签:年龄段统计型标签:支付方式
挖掘标签	企业级用户画像:开发RFM模型实例企业级用户画像: 价格敏感度模型-PSM 企业级用户画像: 用户活跃度模型-RFE 企业级用户画像:基于USG模型使用决策树开发用户购物性别

前言

本章节提供开发标签需要准备的工具类,以及表的说明

用户画像分类（技术）

匹配类:根据用户基本信息进行匹配,完善标签体系.例如:性别,籍贯,学历
统计类:根据用户基本信息或行为数据进行统计完善标签系统.例如:年龄段,客单价,单价最高,退货率
挖掘类:根据用户行为数据挖掘完善标签系统.例如:商品偏好,消费能力,品牌偏好

人口属性

在这里插入图片描述

商业属性

在这里插入图片描述

行为属性

在这里插入图片描述

用户价值

在这里插入图片描述

组合标签

在这里插入图片描述

业务数据表说明

在这里插入图片描述

标签开发流程:

在这里插入图片描述

开发思路:

从MySQL中获取4级和5级的数据:id和rule
从4级rule中获取HBase数据源信息
从5级rule中获取匹配规则
加载HBase数据源
根据需求进行标签计算
数据落地

第一步添加标签

添加四级标签

在这里插入图片描述
规则标签:

inType=HBase##zkHosts=192.168.10.20##zkPort=2181##hbaseTable=tbl_users##family=detail##selectFields=id,gender


inType=HBase	类型：hbase
zkHosts=192.168.10.20	ZK的ip：192.168.10.20
zkPort=2181	ZK的ip：2181
hbaseTable=tbl_users	hbase表：tbl_users
family=detail	列族：detail
selectFields=id,gender	需要读取的列：id,gender

添加五级标签

Hbase数据中用户性别的标识使用1（男）和2（女）标识
在这里插入图片描述
因此在创建标签时，标签的规则需要与数据完全对应

在这里插入图片描述

准备工作

导入POM依赖

<properties>
    <scala.version>2.11.8</scala.version>
    <spark.version>2.2.0</spark.version>
    <hbase.version>1.2.0-cdh5.14.0</hbase.version>
    <solr.version>4.10.3-cdh5.14.0</solr.version>
    <mysql.version>8.0.17</mysql.version>
    <slf4j.version>1.7.21</slf4j.version>

    <maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
    <build-helper-plugin.version>3.0.0</build-helper-plugin.version>
    <scala-compiler-plugin.version>3.2.0</scala-compiler-plugin.version>
    <maven-shade-plugin.version>3.2.1</maven-shade-plugin.version>
</properties>

<dependencies>
    <!-- Spark -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.11</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.scalanlp</groupId>
        <artifactId>breeze_2.11</artifactId>
        <version>0.13</version>
    </dependency>

    <!-- HBase -->
    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-client</artifactId>
        <version>${hbase.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-common</artifactId>
        <version>${hbase.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-server</artifactId>
        <version>${hbase.version}</version>
    </dependency>

    <!-- Solr -->
    <dependency>
        <groupId>org.apache.solr</groupId>
        <artifactId>solr-core</artifactId>
        <version>${solr.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.solr</groupId>
        <artifactId>solr-solrj</artifactId>
        <version>${solr.version}</version>
    </dependency>

    <!-- MySQL -->
    <dependency>
        <groupId>mysql</groupId>
        <artifactId>mysql-connector-java</artifactId>
        <version>${mysql.version}</version>
    </dependency>

    <!-- Logging -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>${slf4j.version}</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-simple</artifactId>
        <version>${slf4j.version}</version>
    </dependency>

    <dependency>
        <groupId>cn.itcast.up29</groupId>
        <artifactId>common</artifactId>
        <version>1.0-SNAPSHOT</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>build-helper-maven-plugin</artifactId>
            <version>${build-helper-plugin.version}</version>
            <executions>
                <execution>
                    <phase>generate-sources</phase>
                    <goals>
                        <goal>add-source</goal>
                    </goals>
                    <configuration>
                        <sources>
                            <source>src/main/java</source>
                            <source>src/main/scala</source>
                        </sources>
                    </configuration>
                </execution>
            </executions>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>${maven-compiler-plugin.version}</version>
            <configuration>
                <encoding>UTF-8</encoding>
                <source>1.8</source>
                <target>1.8</target>
                <verbose>true</verbose>
                <fork>true</fork>
            </configuration>
        </plugin>

        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>${scala-compiler-plugin.version}</version>
            <executions>
                <execution>
                    <goals>
                        <goal>compile</goal>
                        <goal>testCompile</goal>
                    </goals>
                    <configuration>
                        <args>
                            <arg>-dependencyfile</arg>
                            <arg>${project.build.directory}/.scala_dependencies</arg>
                        </args>
                    </configuration>
                </execution>
            </executions>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifest>
                        <!--这里要替换成jar包main方法所在类 -->
                        <mainClass>cn.userprofile.TestTag</mainClass>
                    </manifest>
                    <manifestEntries>
                        <Class-Path>.</Class-Path>
                    </manifestEntries>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id> <!-- this is used for inheritance merges -->
                    <phase>package</phase> <!-- 指定在打包节点执行jar包合并操作 -->
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
<repositories>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>

HBase元数据样例类

代码中会解释为什么要使用样例类！

case class HBaseMeta (
                       inType: String,
                       zkHosts: String,
                       zkPort: String,
                       hbaseTable: String,
                       family: String,
                       selectFields: String,
                       rowKey: String
                     )
object HBaseMeta{
  val INTYPE = "inType"
  val ZKHOSTS = "zkHosts"
  val ZKPORT = "zkPort"
  val HBASETABLE = "hbaseTable"
  val FAMILY = "family"
  val SELECTFIELDS = "selectFields"
  val ROWKEY = "rowKey"
}

----------------------------------------------------------------------------------------
case class TagRule(
                    id:Int,
                    rule:String
                  )

HBase数据源source

package cn.userprofile.tools

import cn.itcast.userprofile.HBaseMeta
import org.apache.spark.sql.{DataFrame, SQLContext, SaveMode}
import org.apache.spark.sql.sources.{BaseRelation, CreatableRelationProvider, DataSourceRegister, RelationProvider}

/**
  * 自定义HBase数据源
  */
class HBaseDataSource extends RelationProvider with CreatableRelationProvider with DataSourceRegister with Serializable {



  /**
    * 读取数据源.
    *
    * @param sqlContext
    * @param parameters 在调用当前DataSource的时候传入的option键值对.
    * @return
    */
  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
    //将parameters里面的HBase相关的数据都取出来.
    //将map封装为HBaseMeta对象
    val meta: HBaseMeta = parseMeta(parameters)
      new HBaseReadableRelation(sqlContext, meta)
  }

  /**
    * 将数据写入到指定的位置,数据落地
    * @param sqlContext
    * @param mode 覆盖/追加
    * @param parameters 构建数据源的时候添加的参数.
    * @param data 需要保存的数据
    * @return
    */
  override def createRelation(sqlContext: SQLContext,
                              mode: SaveMode,
                              parameters: Map[String, String],
                              data: DataFrame): BaseRelation = {
    //先解析参数,将参数构建成HBaseMeta
    val meta: HBaseMeta = parseMeta(parameters)
    //创建HBaseWritableRelation对象
    val relation = new HBaseWritableRelation(sqlContext, meta, data)
    relation.insert(data, true)
    //将对返回
    relation
  }

  /**
    * 将Map转换为HBaseMeta
    * @param params
    * @return
    */
  def parseMeta(params: Map[String, String]): HBaseMeta = {
    HBaseMeta(
      params.getOrElse(HBaseMeta.INTYPE, ""),
      params.getOrElse(HBaseMeta.ZKHOSTS, ""),
      params.getOrElse(HBaseMeta.ZKPORT, ""),
      params.getOrElse(HBaseMeta.HBASETABLE, ""),
      params.getOrElse(HBaseMeta.FAMILY, ""),
      params.getOrElse(HBaseMeta.SELECTFIELDS, ""),
      params.getOrElse(HBaseMeta.ROWKEY, "")
    )
  }
  override def shortName(): String = "hbase"
}

HBase数据源HBaseReadableRelation

package cn.userprofile.tools

import cn.userprofile.bean.HBaseMeta
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.sources.{BaseRelation, TableScan}
import org.apache.spark.sql.types.{StringType, StructField, StructType}

/**
  * 真正读取HBase数据源的Relation
  */
class HBaseReadableRelation(context: SQLContext, meta: HBaseMeta) extends BaseRelation with TableScan with Serializable {
  //定义sqlContext
  override def sqlContext: SQLContext = context
  //定义数据结构的schema 里面定义列名/列的类型/当前列是否可以为空.
  override def schema: StructType = {
    //构建一个StructType,所有列的元数据信息(id:name,type,true   rule:name,type,false)
    //meta.selectFields => 从MySQL的4级标签获取到的.selectFields -> id,job
    val fields: Array[StructField] = meta.selectFields.split(",")
      .map(fieldName => {
        StructField(fieldName, StringType, true)
      })
    //使用Fields构建StructType
    StructType(fields)
  }

  /**
    * 构建数据源,我们可以自己定义从HBase中拿到的数据,封装为Row返回,
    * @return
    */
  override def buildScan(): RDD[Row] = {
    //数据在哪?HBase
    //我们要返回什么数据?RDD[Row]
    //定义HBase相关的conf
    val conf = new Configuration()
    conf.set("hbase.zookeeper.property.clientPort", meta.zkPort)
    conf.set("hbase.zookeeper.quorum", meta.zkHosts)
    conf.set("zookeeper.znode.parent", "/hbase-unsecure")
    conf.set(TableInputFormat.INPUT_TABLE, meta.hbaseTable)

    //从hadoop中构建我们的数据源RDD
    val sourceRDD: RDD[(ImmutableBytesWritable, Result)] = context.sparkContext.newAPIHadoopRDD(
      conf,
      classOf[TableInputFormat],
      classOf[ImmutableBytesWritable],
      classOf[Result]
    )
    //获取Result数据
    val resultRDD: RDD[Result] = sourceRDD.map(_._2)
    //从result中获取我们需要的字段,别的字段都不要.
    //map == 将result => row
    resultRDD.map(result => {
      //获取列的名字,我们可以使用selectFields进行切割.
      val seq: Seq[String] = meta.selectFields.split(",")
        //将列名转换为row,一行数据
        //将String列名=> 具体的列的值.
        .map(fieldName => {
        //如果使用result获取数据,数据类型默认为byte数组,需要使用HBase的工具类Bytes将数据转换为String.
        Bytes.toString(result.getValue(meta.family.getBytes, fieldName.getBytes))
      }).toSeq
      //将列的值封装成row
      Row.fromSeq(seq)
    })
  }
}

HBase数据源HBaseWritableRelation

package cn.userprofile.tools

import cn.userprofile.bean.HBaseMeta
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.client.{Delete, Get, Put, Result}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.sources.{BaseRelation, InsertableRelation}
import org.apache.spark.sql.types.StructType

/**
  * 数据写入的Relation
  */
class HBaseWritableRelation(context: SQLContext, meta: HBaseMeta, data: DataFrame) extends BaseRelation with InsertableRelation with Serializable {

  override def sqlContext: SQLContext = context

  override def schema: StructType = data.schema

  /**
    * 将data数据插入到HBase中.
    *
    * @param data
    * @param overwrite
    */
  override def insert(data: DataFrame, overwrite: Boolean): Unit = {
    //构建HBase相关的配置.
    val config = new Configuration()
    config.set("hbase.zookeeper.property.clientPort", meta.zkPort)
    config.set("hbase.zookeeper.quorum", meta.zkHosts)
    config.set("zookeeper.znode.parent", "/hbase-unsecure")
    config.set("mapreduce.output.fileoutputformat.outputdir", "/test01")
    config.set(TableOutputFormat.OUTPUT_TABLE, meta.hbaseTable)

    val job = Job.getInstance(config)
    job.setOutputKeyClass(classOf[ImmutableBytesWritable])
    job.setOutputValueClass(classOf[Result])
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])

    data
      //将DataFrame转换为RDD
      .rdd
      //将并行度设置为1
      .coalesce(1)
      //将每一行数据row => 插入HBase的put
      .map(row => {
      //向HBase中存入数据需要的是Put对象.
//      new Delete(rowkey)
//      new Get(rowkey)
      //构建put对象的时候需要指定rowkey,我们可以用用户ID作为rowkey.
      val rowkey: String = row.getAs("userId").toString
      val put = new Put(rowkey.getBytes)
      //我们需要向put中插入列.
      meta.selectFields.split(",")
        .map(fieldName => {
          //获取当前列的值
          val value: String = row.getAs(fieldName).toString
          //向Put中添加列
          put.addColumn(meta.family.getBytes, fieldName.getBytes, value.getBytes)
        })
      (new ImmutableBytesWritable, put)
    })
      //将数据进行保存
      .saveAsNewAPIHadoopDataset(job.getConfiguration)
  }
}