一、需求集A有什么?
所有需求link:基于Flink的个人装扮商城群体用户画像与数据实时统计系统(二)-项目介绍与需求介绍
需求集A是针对模拟生成的用户基本信息提出的,包括:
- 群体用户画像之年代标签统计
- 群体用户画像之手机网络使用偏好
- 群体用户画像之电子邮件使用偏好
附:模拟生成的用户基本信息字段
二、用户基本信息模拟生成
- 用户基本信息实体类编写:UserBasicInfo
package cn.edu.neu.bean;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* @author 32098
*/
@Data
@NoArgsConstructor
@AllArgsConstructor
public class UserBasicInfo {
private String userId;
private String username;
private String gender;
private String telephone;
private String email;
private int age;
private String registerTime;
}
- 用户基本信息模拟生成类编写:UserBasicInfoSource
package cn.edu.neu.source;
import cn.edu.neu.bean.UserBasicInfo;
import org.apache.commons.lang.RandomStringUtils;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import java.text.SimpleDateFormat;
import java.util.Arrays;
import java.util.Date;
import java.util.List;
import java.util.Random;
/**
* @author 32098
*
* 模拟系统用户基本信息
*/
public class UserBasicInfoSource extends RichParallelSourceFunction<UserBasicInfo> {
private boolean keepMockData;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
keepMockData = true;
}
private String getRandomDate(String beginDate, String endDate){
try {
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
Date start = format.parse(beginDate);
Date end = format.parse(endDate);
if(start.getTime() >= end.getTime()){
return null;
}
long date = start.getTime() + (long)(Math.random() * (end.getTime() - start.getTime()));
return format.format(date);
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
@Override
public void run(SourceContext<UserBasicInfo> sourceContext) throws Exception {
// 用户ID
String uid;
// 用户姓名
String uName;
// 用户性别
List<String> uGenderList = Arrays.asList("man", "woman");
String uGender;
// 用户手机号码
List<String> telFirstList = Arrays.asList(
("133,153,180,181,189,177,1700,173,199,130,131,132,155,156,185,186,145,176," +
"134,135,136,137,138,139,150,151,152,157,158,159,182,183,184,187,188,147,178").split(","));
String uTelephone;
// 用户邮箱
List<String> emailLast = Arrays.asList(
"@163.com, @126.com, @139.com, @sohu.com, @qq.com, @189.cn, @aliyun.com, @sina.com".split(",")
);
String uEmail;
// 用户年龄
int uAge;
String registerTime;
Random r = new Random();
for(int i=0; i<r.nextInt(1000); i++){
uid = RandomStringUtils.randomNumeric(4);
uName = RandomStringUtils.randomAlphabetic(6);
uGender = uGenderList.get(r.nextInt(2));
uTelephone = telFirstList.get(r.nextInt(telFirstList.size()))
+String.valueOf((r.nextInt(9999)+1)+10000).substring(1)+
String.valueOf((r.nextInt(9999)+1)+10000).substring(1);
uEmail = RandomStringUtils.randomNumeric(6) + emailLast.get(r.nextInt(emailLast.size()));
uAge = (int) (r.nextGaussian()*10+25);
registerTime = getRandomDate("2012-01-01", "2015-12-31");
sourceContext.collect(new UserBasicInfo(uid, uName, uGender, uTelephone, uEmail, uAge, registerTime));
}
// 模拟后续注册用户
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
while (keepMockData) {
uid = RandomStringUtils.randomNumeric(4);
uName = RandomStringUtils.randomAlphabetic(6);
uGender = uGenderList.get(r.nextInt(2));
uTelephone = telFirstList.get(r.nextInt(telFirstList.size()))
+String.valueOf((r.nextInt(9999)+1)+10000).substring(1)+
String.valueOf((r.nextInt(9999)+1)+10000).substring(1);
uEmail = RandomStringUtils.randomNumeric(6) + emailLast.get(r.nextInt(emailLast.size()));
uAge = Math.abs((int) (r.nextGaussian()*12+25));
// registerTime = getRandomDate("2016-01-01", "2018-12-31");
registerTime = format.format(System.currentTimeMillis());
sourceContext.collect(new UserBasicInfo(uid, uName, uGender, uTelephone, uEmail, uAge, registerTime));
Thread.sleep((r.nextInt(60)+1)*1000);
}
}
@Override
public void cancel() {
keepMockData = false;
}
}
上述代码中,getRandomDate()方法放回两个参数之间的任意随机时间,用于模拟个人装扮商城用户的注册时间。在重写的run()方法中,首先通过for循环模拟生成个人装扮商城已注册的用户,且采用了高斯分布模拟年龄;在for循环后,使用while循环模拟个人装扮商城后续注册用户。
三、需求集A实现
- 统计数据类Statics.java
package cn.edu.neu.bean;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
/**
* @author 32098
*/
@AllArgsConstructor
@NoArgsConstructor
@Data
public class Statics {
private String staticsName;
private String staticsDetail;
private Long staticsData;
}
说明:staticsName
指的是我们的统计数据名,如年代标签统计为yearBase、手机网络使用偏好为carrier、电子邮件使用偏好为email;而staticsDetail
为各统计数据名下的详细分类(想不出啥词了,暂且这样),如yearBase的详细分类包括60后、70后、80后、90后、00后、10后等以及carrier的移动、联通、电信等;staticsData 为分类下的统计数据,如个人装扮商城用户90后的用户有1200个。
- 写在需求集A实现前面
- 正如前面用户基本信息模拟生成部分说明一样,我们模拟的信息是已经注册个人装扮商城的用户的基本信息。一般来说,我们需要基于个人装扮商城的使用的数据库(如mysql、hbase等数据库)表的用户表编写Source,这样就不会出现用户(ID)重复的现象;但是上述的模拟代码中当模拟生成的用户超过 1 0 4 10^4 104 时,必然出现用户ID重复,那么下面需求集A的实现看起来是存在问题的,因为可能出现相同用户(即用户ID一致但其它信息不一致)的情况(这个问题当然可以使用keyby&reduce解决,但是在模拟的用户基本信息达到一定程度时,那部分模拟后续注册的用户将毫无意义)。
- 因此,我们不解决模拟用户基本信息可能存在的问题(即不考虑模拟生成的用户ID重复的问题),我们假设我们的数据源是个人装扮商城数据库表的用户表数据,基于此假设,我们实现了需求集A。
擦,我说的啥?可飘过。
- 群体用户画像之年代标签统计:YearBaseTask.java
package cn.edu.neu.task.noneWindowTask;
import cn.edu.neu.bean.Statics;
import cn.edu.neu.bean.UserBasicInfo;
import cn.edu.neu.sink.StaticsSink;
import cn.edu.neu.source.UserBasicInfoSource;
import cn.edu.neu.util.DateUtils;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
*
* @author 32098
*
* 群体用户画像之年代标签
*/
public class YearBaseTask {
public static void main(String[] args) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStreamSource<UserBasicInfo> infoDs = env.addSource(new UserBasicInfoSource());
SingleOutputStreamOperator<Statics> resultDs = infoDs.map(new MapFunction<UserBasicInfo, Statics>() {
@Override
public Statics map(UserBasicInfo userBasicInfo) throws Exception {
int age = userBasicInfo.getAge();
String yearBaseType = DateUtils.getYearBaseByAge(age);
return new Statics("yearBase", yearBaseType, 1L);
}
}).keyBy(Statics::getStaticsDetail).reduce(
new ReduceFunction<Statics>() {
@Override
public Statics reduce(Statics staticsA, Statics staticsB) throws Exception {
String staticsName = staticsA.getStaticsName();
String staticsDetail = staticsA.getStaticsDetail();
Long data1 = staticsA.getStaticsData();
Long data2 = staticsB.getStaticsData();
return new Statics(staticsName, staticsDetail, data1+data2);
}
}
);
resultDs.addSink(new StaticsSink());
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
}
上述需求的实现仅涉及keyby、reduce、sink,容易理解,下面两个需求同。
- 群体用户画像之手机网络使用偏好
package cn.edu.neu.task.noneWindowTask;
import cn.edu.neu.bean.Statics;
import cn.edu.neu.bean.UserBasicInfo;
import cn.edu.neu.sink.StaticsSink;
import cn.edu.neu.source.UserBasicInfoSource;
import cn.edu.neu.util.CarrierUtils;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* @author 32098
*
* 群体用户画像之手机网络使用偏好
*/
public class CarrierTask {
public static void main(String[] args) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStreamSource<UserBasicInfo> infoDs = env.addSource(new UserBasicInfoSource());
SingleOutputStreamOperator<Statics> resultDs = infoDs.map(new MapFunction<UserBasicInfo, Statics>() {
@Override
public Statics map(UserBasicInfo userBasicInfo) throws Exception {
String telephone = userBasicInfo.getTelephone();
String carrier = CarrierUtils.getCarrierByTel(telephone);
return new Statics("carrier", carrier, 1L);
}
}).keyBy(Statics::getStaticsDetail).reduce(
new ReduceFunction<Statics>() {
@Override
public Statics reduce(Statics staticsA, Statics staticsB) throws Exception {
String staticsName = staticsA.getStaticsName();
String staticsDetail = staticsA.getStaticsDetail();
Long data1 = staticsA.getStaticsData();
Long data2 = staticsB.getStaticsData();
return new Statics(staticsName, staticsDetail, data1+data2);
}
}
);
resultDs.addSink(new StaticsSink());
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
}
- 群体用户画像之电子邮件使用偏好
package cn.edu.neu.task.noneWindowTask;
import cn.edu.neu.bean.Statics;
import cn.edu.neu.bean.UserBasicInfo;
import cn.edu.neu.sink.StaticsSink;
import cn.edu.neu.source.UserBasicInfoSource;
import cn.edu.neu.util.EmailUtils;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
*
* @author 32098
*
* 群体用户画像之电子邮件使用偏好
*/
public class EmailTask {
public static void main(String[] args) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
DataStreamSource<UserBasicInfo> infoDs = env.addSource(new UserBasicInfoSource());
SingleOutputStreamOperator<Statics> resultDs = infoDs.map(new MapFunction<UserBasicInfo, Statics>() {
@Override
public Statics map(UserBasicInfo userBasicInfo) throws Exception {
String email = userBasicInfo.getEmail();
String emailType = EmailUtils.getEmailtypeBy(email);
return new Statics("email", emailType, 1L);
}
}).keyBy(Statics::getStaticsDetail).reduce(
new ReduceFunction<Statics>() {
@Override
public Statics reduce(Statics staticsA, Statics staticsB) throws Exception {
String staticsName = staticsA.getStaticsName();
String staticsDetail = staticsA.getStaticsDetail();
Long data1 = staticsA.getStaticsData();
Long data2 = staticsB.getStaticsData();
return new Statics(staticsName, staticsDetail, data1+data2);
}
}
);
resultDs.addSink(new StaticsSink());
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
}
附:
- 上述需求实现涉及的Util
package cn.edu.neu.util;
import java.util.regex.Pattern;
/**
*
* @author 32098
*/
public class CarrierUtils {
/**
* 中国电信号码格式验证 手机段: 133,153,180,181,189,177,1700,173,199
**/
private static final String CHINA_TELECOM_PATTERN = "(^1(33|53|77|73|99|8[019])\\d{8}$)|(^1700\\d{7}$)";
/**
* 中国联通号码格式验证 手机段:130,131,132,155,156,185,186,145,176,1709
**/
private static final String CHINA_UNICOM_PATTERN = "(^1(3[0-2]|4[5]|5[56]|7[6]|8[56])\\d{8}$)|(^1709\\d{7}$)";
/**
* 中国移动号码格式验证
* 手机段:134,135,136,137,138,139,150,151,152,157,158,159,182,183,184,187,188,147,178,1705
**/
private static final String CHINA_MOBILE_PATTERN = "(^1(3[4-9]|4[7]|5[0-27-9]|7[8]|8[2-478])\\d{8}$)|(^1705\\d{7}$)";
/**
* 0、未知 1、移动 2、联通 3、电信
* @param telephone tel
* @return int
*/
public static String getCarrierByTel(String telephone){
boolean b1 = telephone == null || telephone.trim().equals("") ? false : match(CHINA_MOBILE_PATTERN, telephone);
if (b1) {
return "移动";
}
b1 = telephone == null || telephone.trim().equals("") ? false : match(CHINA_UNICOM_PATTERN, telephone);
if (b1) {
return "联通";
}
b1 = telephone == null || telephone.trim().equals("") ? false : match(CHINA_TELECOM_PATTERN, telephone);
if (b1) {
return "电信";
}
return "未知";
}
/**
* 正则匹配
* @param regex regex
* @param tel tel
* @return bool
*/
private static boolean match(String regex, String tel) {
return Pattern.matches(regex, tel);
}
}
package cn.edu.neu.util;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
/**
*
* @author 32098
*/
public class DateUtils {
public static String getYearBaseByAge(int age){
Calendar calendar = Calendar.getInstance();
calendar.setTime(new Date());
calendar.add(Calendar.YEAR, -age);
Date newDate = calendar.getTime();
DateFormat dateFormat = new SimpleDateFormat("yyyy");
String newDateString = dateFormat.format(newDate);
int newDateInteger = Integer.parseInt(newDateString);
String yearbasetype = "未知";
if(newDateInteger >= 1940 && newDateInteger < 1950){
yearbasetype = "40后";
}else if (newDateInteger >= 1950 && newDateInteger < 1960){
yearbasetype = "50后";
}else if (newDateInteger >= 1960 && newDateInteger < 1970){
yearbasetype = "60后";
}else if (newDateInteger >= 1970 && newDateInteger < 1980){
yearbasetype = "70后";
}else if (newDateInteger >= 1980 && newDateInteger < 1990){
yearbasetype = "80后";
}else if (newDateInteger >= 1990 && newDateInteger < 2000){
yearbasetype = "90后";
}else if (newDateInteger >= 2000 && newDateInteger < 2010){
yearbasetype = "00后";
}else if (newDateInteger >= 2010 ){
yearbasetype = "10后";
}
return yearbasetype;
}
}
package cn.edu.neu.util;
/**
*
* @author 32098
*/
public class EmailUtils {
/**
* @param email email
* @return emailType
*/
public static String getEmailtypeBy(String email){
String emailtye = "其他邮箱用户";
if(email.contains("@163.com")||email.contains("@126.com")){
emailtye = "网易邮箱用户";
}else if (email.contains("@139.com")){
emailtye = "移动邮箱用户";
}else if (email.contains("@sohu.com")){
emailtye = "搜狐邮箱用户";
}else if (email.contains("@qq.com")){
emailtye = "QQ邮箱用户";
}else if (email.contains("@189.cn")){
emailtye = "189邮箱用户";
}else if (email.contains("@aliyun.com")){
emailtye = "阿里邮箱用户";
}else if (email.contains("@sina.com")){
emailtye = "新浪邮箱用户";
}
return emailtye;
}
}
- 上述需求实现涉及的Sink类:StaticsSink
package cn.edu.neu.sink;
import cn.edu.neu.bean.Statics;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
/**
* @author 32098
*/
public class StaticsSink extends RichSinkFunction<Statics> {
private Connection conn = null;
private PreparedStatement ps = null;
public StaticsSink(){}
@Override
public void open(Configuration parameters) throws Exception {
conn = DriverManager.getConnection("jdbc:mysql://master:3306/user_portrait", "root", "Hive@2020");
String sql = "insert into statics(static_name, static_detail, static_data) values (?,?,?) on duplicate key update static_data=?";
ps = conn.prepareStatement(sql);
}
@Override
public void invoke(Statics statics, Context context) throws Exception {
ps.setString(1, statics.getStaticsName());
ps.setString(2, statics.getStaticsDetail());
ps.setLong(3, statics.getStaticsData());
ps.setLong(4, statics.getStaticsData());
ps.executeUpdate();
}
@Override
public void close() throws Exception {
if (conn != null) {
conn.close();
}
if (ps != null) {
ps.close();
}
}
}
- Mysql数据库表:
CREATE TABLE `user_portrait`.`statics` (
`static_name` varchar(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`static_detail` varchar(25) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
`static_data` bigint(12) NULL DEFAULT NULL,
PRIMARY KEY (`static_name`, `static_detail`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
下文链接:基于Flink的个人装扮商城群体用户画像与数据实时统计系统(四)-需求集B实现
上文链接:基于Flink的个人装扮商城群体用户画像与数据实时统计系统(三)-实验环境与项目结构