*电商项目实战*
项目使用scala编写,项目中使用的数据下载链接
数据内容为电商的用户点击数据,以”-”分隔,部分数据示例:
2019-07-17_95_26070e87-1ad7-49a3-8fb3-cc741facaddf_37_2019-07-17 00:00:02_手机_-1_-1_null_null_null_null_3
2019-07-17_95_26070e87-1ad7-49a3-8fb3-cc741facaddf_48_2019-07-17 00:00:10_null_16_98_null_null_null_null_19
2019-07-17_95_26070e87-1ad7-49a3-8fb3-cc741facaddf_6_2019-07-17 00:00:17_null_19_85_null_null_null_null_7
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_29_2019-07-17 00:00:19_null_12_36_null_null_null_null_5
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_22_2019-07-17 00:00:28_null_-1_-1_null_null_15,1,20,6,4_15,88,75_9
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_11_2019-07-17 00:00:29_苹果_-1_-1_null_null_null_null_7
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_24_2019-07-17 00:00:38_null_-1_-1_15,13,5,11,8_99,2_null_null_10
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_24_2019-07-17 00:00:48_null_19_44_null_null_null_null_4
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_47_2019-07-17 00:00:54_null_14_79_null_null_null_null_2
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_27_2019-07-17 00:00:59_null_3_50_null_null_null_null_26
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_27_2019-07-17 00:01:05_i7_-1_-1_null_null_null_null_17
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_24_2019-07-17 00:01:07_null_5_39_null_null_null_null_10
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_25_2019-07-17 00:01:13_i7_-1_-1_null_null_null_null_24
2019-07-17_38_6502cdc9-cf95-4b08-8854-f03a25baa917_22_2019-07-17 00:01:21_null_19_62_null_null_null_null_20
*数据格式(日期_用户ID_SessionID_页面ID_时间戳_搜索关键字_点击品类ID_点击产品ID_订单品类iD_订单产品ID_支付品类ID_支付产品ID_城市ID)*
在写需求之前,先稍微写一个框架,把重复操作的内容抽象出来。
先导入相关的依赖jar包,别管用不用得到,之后肯定用得到,注意不要重了,之前wordCount导入了一部分。
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-all</artifactId>
<version>4.1.17.Final</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>RELEASE</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.5</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.6</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>2.4.6</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.1.10</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.8</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>2.12.8</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>2.12.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.12</artifactId>
<version>2.4.6</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.3.3</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
项目采用三层架构(Dao-Service-Controller)
Scala下创建一个package,用于存放抽象框架,暂时命名为core,包下创建四个类,分别为TApplication,TController,TService,TDao.
其中Application就是程序的主入口,Controller为控制器,负责调用计算逻辑。Service中编写实际的计算逻辑,Dao用于数据交互,demo类也莫得数据库Dao基本上是闲置的。
TApplication->将重复的打开、关闭spark链接的操作抽象出来,通过start函数实现,把实际业务抽象为op。
trait TApplication {
var envData : Any = null;
//创建链接
def start(t:String = "jdbc")(op : =>Unit) : Unit={
if(t=="spark"){
envData = EnvUtil.getEnv()
}
//业务抽象
try{
op
}catch {
case ex: Exception=> println(ex.getMessage)
}
//关闭链接
if(t=="spark"){
EnvUtil.clear()
}
}
}
TController->里面只有一个excute抽象函数,Controller继承后重写这个函数。
trait TController {
def excute():Unit }
TService->只有一个analtsis函数,重写这个函数编写实际的数据处理逻辑。
trait TService {
/**
* 数据分析
* @return
*/
def analysis():Any
}
TDao->readFile是每个需求都会用到的方法,放在抽象类里,如果还有其他需求则继承TDao之后再进行编写。
trait TDao {
def readFile(path:String):RDD[String] = {
EnvUtil.getEnv().textFile(path)
}
}
*需求1.统计最受欢迎的品类,先排序点击-再订单-最后支付*
热门品类前十:HotCateGory
*HotCateGoryController*
->没什么好说的,就是个计算逻辑入口,控制器而已。
private val hotCateGoryService = new HotCateGoryService
override def excute(): Unit = {
val result = hotCateGoryService.analysis()
result.foreach(println)
}
}
*HotCateGoryService*
->逻辑功能,思路就是分别对点击、订单、支付三种数据进行统计求和,以品类ID为key,以点击/订单/支付为value生成RDD,最终将三个数据合并扁平化处理。排序取前十位,就是个稍微复杂点的wordcount。
/**
* 数据分析
* 数据格式(日期_用户ID_SessionID_页面ID_时间戳_搜索关键字_点击品类ID_点击产品ID_订单品类iD_订单产品ID_支付品类ID_支付产品ID_城市ID)
*/
private val hotCateGoryDao = new HotCateGoryDao()
override def analysis(): Array[(String,(Int,Int,Int))] = {
//TODO 读取电商日志
val actionRDD: RDD[String] = hotCateGoryDao.readFile("input/user_visit_action.txt")
//数据缓存,防止重复读取文件资源浪费
actionRDD.cache()
//TODO 对点击统计
val clickRDD = actionRDD.map(action => {
val datas = action.split("_")
//数据组的第6个数据是指品类
(datas(6),1)
}).filter(_._1 != "-1")//过滤掉-1无效点击
val cateGoryIdToClickCountRDD: RDD[(String,Int)] = clickRDD.reduceByKey(_+_)
//TODO 对下单统计
/**
* 订单中可以有多个商品(品类,订单次数)就会变成 ↓
* (品类1,品类2,品类3,1)
*/
val orderRDD = actionRDD.map(action => {
val datas = action.split("_")
//数据组的第8个数据是指订单
(datas(8))
}).filter(_ != "null")//过滤掉null无效订单
//扁平化(品类1,品类2,品类3,1)->(品类1,1)(品类2,1)(品类3,1)
val orderToOneRDD = orderRDD.flatMap{
id => {
val ids = id.split(",")
ids.map(id => (id,1))
}
}
val cateGoryIdToOrderCountRDD: RDD[(String,Int)] = orderToOneRDD.reduceByKey(_ + _)
//TODO 对付款统计
val payRDD = actionRDD.map