目录
一、业务需求
网站用户行为记录存储在MongoDB数据库中,利用大数据计算出UV统计结果
二、业务实现方案
1.技术栈
Hadoop CDH(大数据集群管理)+MongoDB(数据库)+Spark(数据计算分析)+MySQL
2.业务实现流程
- 本地写spark sql统计分析代码并做测试
- 将写完的代码打成jar包上传到Hadoop集群中
- 在Hadoop集群进行测试写好的代码
- 设置定时任务脚本,每天执行一回进行统计计算分析
三、技术实现
写spark代码,用scala语言编写:
1.添加pom依赖
<!-- spark 连接 mongo的连接器 -->
<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
备注:版本必须一致,不然会出现问题
2.代码实现
2.1 UVStatJob 主类
package com.qt.wisdomteaching.datacenter.log
import com.mongodb.spark.MongoSpark
import com.qt.wisdomteaching.datacenter.dao.StatDAO
import com.qt.wisdomteaching.datacenter.entity.{UVAccessStat}
import org.apache.spark.sql.{DataFrame, SparkSession}
import scala.collection.mutable.ListBuffer
/**
* @Auther: yaohongan
* @Date: 2020/2/3 09:35
* @Description:统计UV数据,连接MongoDB数据库
*/
object UVStatJob {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[2]")
.appName("UVStatJob")
.config("spark.mongodb.input.uri", "mongodb://test:B5rARkOh4d*P@127.0.0.1:27017/test.user")
.getOrCreate()
// 设置log级别
spark.sparkContext.setLogLevel("WARN")
val accessDF = MongoSpark.load(spark)
accessDF.show()
//accessDF.printSchema()
//统计pv
UVStat(spark, accessDF)
spark.stop()
/**
* 统计UV数据
*/
def UVStat(spark: SparkSession, accessDF:DataFrame): Unit ={
accessDF.createOrReplaceTempView("user")
// 统计UV数据
val resDf = spark.sql(" select date_format(date,'yyyyMMdd') day,count(uid) uv from user group by date")
resDf.show()
//resDf.printSchema()
/**
* 将统计结果写入到MySQL中
*/
try {
resDf.foreachPartition(partitionOfRecords => {
val list = new ListBuffer[UVAccessStat]
partitionOfRecords.foreach(info => {
val day = info.getAs[String]("day")
val uv = info.getAs[Long]("uv")
list.append(UVAccessStat(day, uv))
})
StatDAO.insertUVAccessStat(list)
})
} catch {
case e:Exception => e.printStackTrace()
}
}
}
}
备注:test是MongoDB的用户名,B5rARkOh4d*P是MongoDB密码,test是库名,user是表名
2.2 UVAccessStat 映射实体类
package com.qt.wisdomteaching.datacenter.entity
import org.apache.spark.sql.types.{DateType, TimestampType}
/**
* @Auther: yaohongan
* @Date: 2020/2/5 14:23
* @Description:
*/
case class UVAccessStat(date: String,hour:String,uv: Long) {
}
2.3 StatDAO 数据库操作类
package com.qt.wisdomteaching.datacenter.dao
import java.sql.{Connection, Date, PreparedStatement}
import com.qt.wisdomteaching.datacenter.entity.{LogDatacenter, PVAccessStat, UVAccessStat, UrlAccessStat}
import com.qt.wisdomteaching.datacenter.util.MySQLUtile
import scala.collection.mutable.ListBuffer
/**
* @Auther: yaohongan
* @Date: 2019/10/9 09:33
* @Description:将清理好的数据保存到MySQL
*/
object StatDAO {
/**
*批量保存UV统计结果到数据库
* @param list
*/
def insertUVAccessStat(list: ListBuffer[UVAccessStat]): Unit ={
var connection: Connection = null
var pstmt: PreparedStatement = null
try {
connection = MySQLUtile.getConnection()
connection.setAutoCommit(false) //设置手动提交
val sql = "insert into stat_uv(day,uv,hour) values (?,?,?) "
pstmt = connection.prepareStatement(sql)
for (ele <- list) {
pstmt.setString(1,ele.date)
pstmt.setLong(2, ele.uv)
pstmt.setString(3,ele.hour)
pstmt.addBatch()
}
pstmt.executeBatch() // 执行批量处理
connection.commit() //手工提交
} catch {
case e: Exception => e.printStackTrace()
} finally {
MySQLUtile.release(connection, pstmt)
}
}
}
2.4 MySQL数据库工具类
package com.qt.wisdomteaching.datacenter.util
import java.sql.{Connection, DriverManager, PreparedStatement}
/**
* @Auther: yaohongan
* @Date: 2019/3/18 14:46
* @Description: MySQL连接工具类
*/
object MySQLUtile {
/**
* 获取数据库连接
*/
def getConnection() = {
DriverManager.getConnection("jdbc:mysql://localhost:3306/test?user=admin&password=admin")
}
/**
* 释放数据库连接等资源
* @param connection
* @param pstmt
*/
def release(connection: Connection, pstmt: PreparedStatement): Unit = {
try {
if (pstmt != null) {
pstmt.close()
}
} catch {
case e: Exception => e.printStackTrace()
} finally {
if (connection != null) {
connection.close()
}
}
}
def main(args: Array[String]) {
println(getConnection())
}
}
3.运行结果
3.1 MongoDB数据:
3.2 统计UV数据结果:
3.3 MySQL数据库数据