用户行为数据会话（session）切割及spark读写clickhouse完整实现

最新推荐文章于 2023-05-05 22:45:00 发布

HD0do(迪答数据)

最新推荐文章于 2023-05-05 22:45:00 发布

阅读量2k

点赞数 1

分类专栏： clickhouse spark 文章标签： spark 大数据 etl big data scala

本文链接：https://blog.csdn.net/HD0do/article/details/122040063

版权

clickhouse 同时被 2 个专栏收录

4 篇文章 1 订阅

订阅专栏

spark

4 篇文章 0 订阅

订阅专栏

业务背景：

对采集到的埋点数据进行分析，因为埋点原始数据没有对应会话ID（sessionID），即对于一个人的一次访问超过一定时间间隔（根据业务需要进行设定，本次定为20min）没有操作则记录为一个不重复的id，超过该时间间隔后再次操作就记成另一个sessionid。

使用技术：

1.sparkSQL

2.spark读写clickhouse

3.scala

业务实现：

1.pow依赖

    <properties>
        <spark.version>3.2.0</spark.version>
        <scala.binary.version>2.12</scala.binary.version>
        <flink.version>1.13.3</flink.version>
        <java.version>1.8</java.version>
        <slf4j.version>1.7.30</slf4j.version>
        <clickhouse.jdbc>0.2.3</clickhouse.jdbc>
    </properties>
    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>com.thoughtworks.paranamer</groupId>
            <artifactId>paranamer</artifactId>
            <version>2.8</version>
        </dependency>


        <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-core -->
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
            <version>2.12.3</version>
        </dependency>

        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.12.3</version>
        </dependency>


        <!-- https://mvnrepository.com/artifact/ru.yandex.clickhouse/clickhouse-jdbc -->
        <dependency>
            <groupId>ru.yandex.clickhouse</groupId>
            <artifactId>clickhouse-jdbc</artifactId>
            <version>0.2.3</version>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.62</version>
        </dependency>
    </dependencies>

    <build>
        <finalName>UserLogETL</finalName>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.4.6</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.0.0</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

        </plugins>
    </build>

2.操作clickhouse

因为埋点数据直接推送到kafka，后再使用clickhouse的kafka引擎表进行获取，所以需要先读取clickhouse中的数据

a.读取clickhouse

  //创建sparkconf
     val sparkConf: SparkConf = new SparkConf().setAppName("logSession").setMaster("local[*]")

    //创建sparksession
    val sparkSession:SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

    val querySql =
      s"""(select
         |login_id,event,properties,login_id,distinct_id,timestamp,dt,time from ods_user_log) tmp""".stripMargin

    //读取clickhouse
    val dataDf = sparkSession
      .read
      .format("jdbc")
      .option("url", "jdbc:clickhouse://ip:8123")
      .option("fetchsize", "500000")
      .option("driver", "ru.yandex.clickhouse.ClickHouseDriver")
      .option("user", "ck")
      .option("password", "1234")
      .option("dbtable",querySql )
      .load()

读取参数设置https://spark.apache.org/docs/latest/sql-data-sources-jdbc.htmlhttps://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

b.写入clickhouse

这一个步骤需要你在clickhouse简历好对应的表，也是通过dataframe数据类型将数据写入clickhouse中

    userLogDF
      .write
      .format("jdbc")
      .mode(SaveMode.Append)
      .option("driver", "ru.yandex.clickhouse.ClickHouseDriver")
      .option("url", "jdbc:clickhouse://ip:8123")
      .option("batchsize", "50000")
      .option("user", "ck")
      .option("password", "1234")
      .option("isolationLevel", "NONE")
      .option("dbtable","tablename" )
      .save()

3.会话切割实现

数据的处理过程中在实现会话切割的过程中也进行了对退出页面的判断和浏览顺序的排序；

import java.sql.{Date, Timestamp}
import com.alibaba.fastjson.{JSON, JSONObject}
import com.fasterxml.jackson.databind.KeyDeserializer.None
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import scala.collection.mutable



object UserLogETL {

  def main(args: Array[String]): Unit = {

//传入参数实现查询条件的过滤
    var endTime:String = args(0)
    var startTime:String = args(1) + " "+"18:00:00"

    //创建sparkconf
     val sparkConf: SparkConf = new SparkConf().setAppName("logSession").setMaster("local[*]")

    //创建sparksession
    val sparkSession:SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

    val querySql =
      s"""(select
         |login_id,event,properties,login_id,distinct_id,timestamp,dt,time from dev_ods.ods_user_log
         |where time >toDateTime('$startTime')  and dt between toDate('$startTime')and toDate('$endTime') ) tmp""".stripMargin

    //读取clickhouse
    val dataDf = sparkSession
      .read
      .format("jdbc")
      .option("url", "jdbc:clickhouse://ip:8123")
      .option("fetchsize", "500000")
      .option("driver", "ru.yandex.clickhouse.ClickHouseDriver")
      .option("user", "ck")
      .option("password", "1234")
      .option("dbtable",querySql )
      .load()


    //将读取到的数据转为Json数据的df
    val jsonStrRdd: RDD[String] = dataDf.toJSON.rdd

    //将数据转为JSONobject,并且按照用户分组
    val groupByuserRDD = jsonStrRdd.map(str => {
      JSON.parseObject(str)
    }).filter(f = jsonstr => {
      jsonstr.getJSONObject("properties").getString("tenantid") != null
    })
      .groupBy(_.getString("distinct_id"))

    val sessionLog: RDD[(String, mutable.Buffer[JSONObject])] = groupByuserRDD.mapValues(iter => {
      //对组内值排序
      val sortList: List[JSONObject] = iter.toList.sortWith(
        (left, right) => {
          left.getLongValue("timestamp") < right.getLongValue("timestamp")
        }
      )

      val bufferArray: mutable.Buffer[JSONObject] = sortList.toArray.toBuffer

      //sessionId切割
      //定义sessionId
      var sessionId: String = bufferArray.head.getString("distinct_id") + "_" + bufferArray.head.getString("timestamp")
      //写入sessionId
      for (i <- bufferArray.indices) {
        if (i > 0) {
          if (bufferArray(i).getLong("timestamp") - bufferArray(i - 1).getLong("timestamp") > 20 * 60 * 1000) {
            sessionId = bufferArray(i).getString("distinct_id") + "_" + bufferArray(i).getString("timestamp")
          }

        }
        bufferArray(i).put("session_id", sessionId)
        bufferArray(i).put("tenant_id",bufferArray(i).getJSONObject("properties").getString("tenantid"))
        bufferArray(i).put("shop_group_2_id",bufferArray(i).getJSONObject("properties").getString("shop_group_2_id"))
        bufferArray(i).put("shop_id",bufferArray(i).getJSONObject("properties").getString("shop_id"))
        bufferArray(i).put("shop_group_1_id",bufferArray(i).getJSONObject("properties").getString("shop_group_1_id"))
        bufferArray(i).put("s_channel_id",bufferArray(i).getJSONObject("properties").getString("s_channel_id"))
      }
      bufferArray
    })
    import sparkSession.implicits._

    //浏览时长,浏览顺序
    val jsonObjectRDD: RDD[JSONObject] = sessionLog.map(_._2).flatMap(_.toList)
    val stpDurRDD: RDD[UserLog] = jsonObjectRDD.
      groupBy(_.getString("session_id")).mapValues(iter => {
      //对组内值排序
      val sortList: List[JSONObject] = iter.toList.sortWith(
        (left, right) => {
          left.getLongValue("timestamp") < right.getLongValue("timestamp")
        })
      val bufferArray: Array[JSONObject] = sortList.toArray
      //定义是浏览顺序
      var viewRk: Int = 0
      //定义上一个时间
      var lastViewIndex:Integer = 0

      for (i <- 0 until bufferArray.length) {
        if (viewRk == 0 && "pageShow".equals(bufferArray(i).getString("event")) || (viewRk == 0 && "detailPageView".equals(bufferArray(i).getString("event")))) {

          viewRk += 1
          bufferArray(i).put("view_rk", viewRk)
          //判断是否为最后一个
          if (i==bufferArray.length-1) {
            bufferArray(i).put("step_dur",0)
            bufferArray(i).put("isexit",1)
          }else{
            lastViewIndex = i
          }
        } else if (viewRk != 0 && "pageShow".equals(bufferArray(i).getString("event")) || (viewRk != 0 && "detailPageView".equals(bufferArray(i).getString("event")))) {
          viewRk += 1
          bufferArray(i).put("view_rk", viewRk)
          bufferArray(lastViewIndex).put("step_dur",bufferArray(i).getLongValue("timestamp")-bufferArray(lastViewIndex).getLongValue("timestamp"))
          if (i==bufferArray.length-1){
            bufferArray(i).put("step_dur",0)
            bufferArray(i).put("isexit",1)
          }else{
            lastViewIndex = i
          }
        } else {
          if (i==bufferArray.length-1){
            bufferArray(lastViewIndex).put("step_dur",0)
          }
        }
      }
      bufferArray
    }).map(_._2).flatMap(_.toList)
      .map(jsonobject => {
        val log: UserLog = JSON.parseObject(jsonobject.toString(), classOf[UserLog])
        log
      })
      .filter(userlog =>{
      endTime.equals(userlog.dt.toString)
    })

    //写入clickhouse

    val userLogDF: DataFrame = stpDurRDD.toDF()
    userLogDF.show()

    userLogDF
      .write
      .format("jdbc")
      .mode(SaveMode.Append)
      .option("driver", "ru.yandex.clickhouse.ClickHouseDriver")
      .option("url", "jdbc:clickhouse://ip:8123")
      .option("batchsize", "50000")
      .option("user", "ck")
      .option("password", "1234")
      .option("isolationLevel", "NONE")
      .option("dbtable","tablename" )
      .save()

    println(userLogDF.count())

    //关闭会话
    sparkSession.close()


  }
}
  case class UserLog(tenant_id:String,
                     login_id:String,
                     shop_id:String,
                     shop_group_1_id:String,
                     shop_group_2_id:String,
                     s_channel_id:String,
                     session_id:String,
                     properties:String,
                     distinct_id:String,
                     time:Timestamp,
                     isexit:Int,
                     view_rk:Int,
                     step_dur:Long,
                     timestamp:Long,
                     event:String,
                     dt:Date,
                    )

整体会话切割的主要实现思路就是先根据用户进行分组，后再使用mapvalues算子对同一个分组中的值根据事件发生的时间排序，而后进行两两比较如果两个的时间大于指定的时间间隔（代码定义为20min），则通过时间戳加用户id生成一个新的sessionId，在对后来的数据赋值新生成的sessionid。

对于代码使用中的JSONobject两个getLong和getLongValue方法的区别源码解读：

 public Long getLong(String key) {
        Object value = get(key);

        return castToLong(value);
    }

    public long getLongValue(String key) {
        Object value = get(key);

        Long longVal = castToLong(value);
        if (longVal == null) {
            return 0L;
        }

spark任务提交

目前使用的提交方式是spark的 Standalone

bin/spark-submit \
--class com.test.UserLogETL \
--master spark://ip:7077 \
--executor-memory 2G \
--total-executor-cores 2 \
/home/suser/UserLogETL.jar \
"2021-12-29" "2021-12-28"

参数说明：

参数	解释	可选值举例
--class	Spark程序中包含主函数的类
--master	Spark程序运行的模式	本地模式：local[*]、spark://hadoop102:7077、 Yarn
--executor-memory 1G	指定每个executor可用内存为1G	符合集群内存配置即可，具体情况具体分析。
--total-executor-cores 2	指定所有executor使用的cpu核数为2个
application-jar	打包好的应用jar，包含依赖。这个URL在集群中全局可见。比如hdfs:// 共享存储系统，如果是file:// path，那么所有的节点的path都包含同样的jar
application-arguments	传给main()方法的参数