目录
本篇文章记录各区域热门商品统计-需求分析、技术方案设计以及数据设计。
需求分析
- 根据用户指定的日期范围,统计各个区域下的最热门的top3商品。
- Spark作业接收taskid,查询对应的MySQL中的task,获取用户指定的筛选参数;统计出指定日期范围内的,各个区域的top3热门商品;最后将结果写入MySQL表中。
技术方案设计
- 查询task,获取日期范围,通过Spark SQL,查询user_visit_action表中的指定日期范围内的数据,过滤出,商品点击行为,click_product_id is not null;click_product_id != 'NULL';click_product_id != 'null';city_id,click_product_id
- 使用Spark SQL从MySQL中查询出来城市信息(city_id、city_name、area),用户访问行为数据要跟城市信息进行join,city_id、city_name、area、product_id,RDD,转换成Dataset,注册成一个临时表
- Spark SQL内置函数(case when),对area打标记(华东大区,A级,华中大区,B级,东北大区,C级,西北大区,D级),area_level
- 计算出来每个区域下每个商品的点击次数,group by area, product_id;保留每个区域的城市名称列表;自定义UDAF,group_concat_distinct()函数,聚合出来一个city_names字段,area、product_id、city_names、click_count
- join商品明细表,hive(product_id、product_name、extend_info),extend_info是json类型,自定义UDF,get_json_object()函数,取出其中的product_status字段,if()函数(Spark SQL内置函数),判断,0 自营,1 第三方;(area、product_id、city_names、click_count、product_name、product_status)
- 开窗函数,根据area来聚合,获取每个area下,click_count排名前3的product信息;area、area_level、product_id、city_names、click_count、product_name、product_status
- 结果写入MySQL表中
- Spark SQL的数据倾斜解决方案?双重group by、随机key以及扩容表(自定义UDF函数,random_key())、Spark SQL内置的reduce join转换为map join、提高shuffle并行度
- 本地测试和生产环境的测试
基础数据的准备和设计
1、MySQL city_info表,city_id、city_name、area
/* SQLyog Ultimate v11.25 (64 bit) MySQL - 5.5.20 : Database - spark_project ********************************************************************* */ /*!40101 SET NAMES utf8 */; /*!40101 SET SQL_MODE=''*/; /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */; CREATE DATABASE /*!32312 IF NOT EXISTS*/`spark_project` /*!40100 DEFAULT CHARACTER SET utf8 */; USE `spark_project`; /*Table structure for table `city_info` */ DROP TABLE IF EXISTS `city_info`; CREATE TABLE `city_info` ( `city_id` int(11) DEFAULT NULL, `city_name` varchar(255) DEFAULT NULL, `area` varchar(255) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; /*Data for the table `city_info` */ insert into `city_info`(`city_id`,`city_name`,`area`) values (0,'北京','华北'),(1,'上海','华东'),(2,'南京','华东'),(3,'广州','华南'),(4,'三亚','华南'),(5,'武汉','华中'),(6,'长沙','华中'),(7,'西安','西北'),(8,'成都','西南'),(9,'哈尔滨','东北'); /*!40101 SET SQL_MODE=@OLD_SQL_MODE */; /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */; /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */; /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
2、MySQL中 area_top3_product表,task_id、area、area_level、product_id、city_names、click_count、product_name、product_status
/* SQLyog Ultimate v11.25 (64 bit) MySQL - 5.5.20 : Database - spark_project ********************************************************************* */ /*!40101 SET NAMES utf8 */; /*!40101 SET SQL_MODE=''*/; /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */; CREATE DATABASE /*!32312 IF NOT EXISTS*/`spark_project` /*!40100 DEFAULT CHARACTER SET utf8 */; USE `spark_project`; /*Table structure for table `area_top3_product` */ DROP TABLE IF EXISTS `area_top3_product`; CREATE TABLE `area_top3_product` ( `task_id` int(11) DEFAULT NULL, `area` varchar(255) DEFAULT NULL, `area_level` varchar(255) DEFAULT NULL, `product_id` int(11) DEFAULT NULL, `city_names` varchar(255) DEFAULT NULL, `click_count` int(11) DEFAULT NULL, `product_name` varchar(255) DEFAULT NULL, `product_status` varchar(255) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8; /*Data for the table `area_top3_product` */ /*!40101 SET SQL_MODE=@OLD_SQL_MODE */; /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */; /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */; /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
3、Hive表中,要有一个product_info表,product_id、product_name、extend_info
现在在代码中模拟,测试的时候在到hive中创建表
MockData.java
package graduation.java.test; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import java.util.Random; import java.util.UUID; import graduation.java.util.DateUtils; import graduation.java.util.StringUtils; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.StructType; /** * FileName: MockData * Author: hadoop * Email: 3165845957@qq.com * Date: 19-3-1 上午10:30 * Description: * 模拟生成用户访问数据 */ public class MockData { /** * 模拟数据 * @param sc * @param sqlContext */ public static void mock(JavaSparkContext sc, SQLContext sqlContext) { List<Row> rows = new ArrayList<Row>(); String[] searchKeywords = new String[] {"火锅", "蛋糕", "重庆辣子鸡", "重庆小面", "呷哺呷哺", "新辣道鱼火锅", "国贸大厦", "太古商场", "日本料理", "温泉"}; String dates = DateUtils.getTodayDate(); String[] actions = new String[]{"search", "click", "order", "pay"}; Random random = new Random(); for(int i = 0; i < 100; i++) { long userid = random.nextInt(100); for(int j = 0; j < 10; j++) { String sessionid = UUID.randomUUID().toString().replace("-", ""); String baseActionTime = dates + " " + random.nextInt(23); for(int k = 0; k < random.nextInt(100); k++) { long pageid = random.nextInt(10); String actionTime = baseActionTime + ":" + StringUtils.fulfuill(String.valueOf(random.nextInt(59))) + ":" + StringUtils.fulfuill(String.valueOf(random.nextInt(59))); String searchKeyword = null; Long clickCategoryId = Long.MAX_VALUE; Long clickProductId = Long.MAX_VALUE; String orderCategoryIds = null; String orderProductIds = null; String payCategoryIds = null; String payProductIds = null; String action = actions[random.nextInt(4)]; if("search".equals(action)) { searchKeyword = searchKeywords[random.nextInt(10)]; } else if("click".equals(action)) { clickCategoryId = Long.valueOf(String.valueOf(random.nextInt(100))); clickProductId = Long.valueOf(String.valueOf(random.nextInt(100))); } else if("order".equals(action)) { orderCategoryIds = String.valueOf(random.nextInt(100)); orderProductIds = String.valueOf(random.nextInt(100)); } else if("pay".equals(action)) { payCategoryIds = String.valueOf(random.nextInt(100)); payProductIds = String.valueOf(random.nextInt(100)); } Row row = RowFactory.create(dates, userid, sessionid, pageid, actionTime, searchKeyword, clickCategoryId, clickProductId, orderCategoryIds, orderProductIds, payCategoryIds, payProductIds, Long.valueOf(String.valueOf(random.nextInt(10)))); rows.add(row); } } } JavaRDD<Row> rowsRDD = sc.parallelize(rows); StructType schema = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("dates", DataTypes.StringType, true), DataTypes.createStructField("user_id", DataTypes.LongType, true), DataTypes.createStructField("session_id", DataTypes.StringType, true), DataTypes.createStructField("page_id", DataTypes.LongType, true), DataTypes.createStructField("action_time", DataTypes.StringType, true), DataTypes.createStructField("search_keyword", DataTypes.StringType, true), DataTypes.createStructField("click_category_id", DataTypes.LongType, true), DataTypes.createStructField("click_product_id", DataTypes.LongType, true), DataTypes.createStructField("order_category_ids", DataTypes.StringType, true), DataTypes.createStructField("order_product_ids", DataTypes.StringType, true), DataTypes.createStructField("pay_category_ids", DataTypes.StringType, true), DataTypes.createStructField("pay_product_ids", DataTypes.StringType, true), DataTypes.createStructField("city_id", DataTypes.LongType, true))); Dataset df = sqlContext.createDataFrame(rowsRDD, schema); df.registerTempTable("user_visit_action"); df.show(); /** * ================================================================== */ rows.clear(); String[] sexes = new String[]{"male", "female"}; for(int i = 0; i < 100; i ++) { long userid = i; String username = "user" + i; String name = "name" + i; int age = random.nextInt(60); String professional = "professional" + random.nextInt(100); String city = "city" + random.nextInt(100); String sex = sexes[random.nextInt(2)]; Row row = RowFactory.create(userid, username, name, age, professional, city, sex); rows.add(row); } rowsRDD = sc.parallelize(rows); StructType schema2 = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("user_id", DataTypes.LongType, true), DataTypes.createStructField("username", DataTypes.StringType, true), DataTypes.createStructField("name", DataTypes.StringType, true), DataTypes.createStructField("age", DataTypes.IntegerType, true), DataTypes.createStructField("professional", DataTypes.StringType, true), DataTypes.createStructField("city", DataTypes.StringType, true), DataTypes.createStructField("sex", DataTypes.StringType, true))); Dataset df2 = sqlContext.createDataFrame(rowsRDD, schema2); df2.show(); df2.registerTempTable("user_info"); /** * ================================================================== */ rows.clear(); int[] productStatus = new int[]{0, 1}; for(int i = 0; i < 100; i ++) { long productId = i; String productName = "product" + i; String extendInfo = "{\"product_status\": " + productStatus[random.nextInt(2)] + "}"; Row row = RowFactory.create(productId, productName, extendInfo); rows.add(row); } rowsRDD = sc.parallelize(rows); StructType schema3 = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("product_id", DataTypes.LongType, true), DataTypes.createStructField("product_name", DataTypes.StringType, true), DataTypes.createStructField("extend_info", DataTypes.StringType, true))); Dataset df3 = sqlContext.createDataFrame(rowsRDD, schema3); df3.show(); df3.registerTempTable("product_info"); } }