1 实时ETL
1.1 需求背景
针对算法产生的日志数据进行清洗拆分
•1:算法产生的日志数据是嵌套json格式,需要拆分打平
•2:针对算法中的国家字段进行大区转换
•3:把数据回写到Kafka
1.2 项目架构
视频网站(抖音),生成日志的时候,他们日志里面是把多条数据合并成一条数据了。
1.3 方案设计
日志格式:
直播平台(是不是国内,但是类似于国内的抖音)
处理前:
{"dt":"2019-11-19 20:33:39","countryCode":"TW","data":[{"type":"s1","score":0.8,"level":"D"},{"type":"s2","score":0.1,"level":"B"}]}
kafka:
如何去评估存储,10亿,评估每条数据多大,50k-几百k
我们公司里面还有几百个topic,数据都是这样的一个情况,所以我们有很多的实时任务都是进行ETL
处理后:
"dt":"2019-11-19 20:33:39","countryCode":"TW","type":"s1","score":0.8,"level":"D"
"dt":"2019-11-19 20:33:39","countryCode":"TW","type":"s2","score":0.1,"level":"B"
其实是需要我们处理成:
"dt":"2019-11-19 20:33:39","area":"AREA_CT","type":"s1","score":0.8,"level":"D"
"dt":"2019-11-19 20:33:39","area":"AREA_CT","type":"s2","score":0.1,"level":"B"
我们日志里面有地区,地区用的是编号,需要我们做ETL的时候顺带也要转化一下。
如果用SparkStrimming怎么做?
1.读取redis里面的数据,作为一个广播变量
2.读区Kafka里面的日志数据
flatMap,把广播变量传进去。
如果是用flink又怎么做?
hset areas AREA_US US
hset areas AREA_CT TW,HK
hset areas AREA_AR PK,KW,SA
hset areas AREA_IN IN
flink -> reids -> k,v HashMap
US,AREA_US
TW,AREA_CT
HK,AREA_CT
IN,AREA_IN
{"dt":"2019-11-19 20:33:41","countryCode":"KW","data":[{"type":"s2","score":0.2,"level":"A"},{"type":"s1","score":0.2,"level":"D"}]}
{"dt":"2019-11-19 20:33:43","countryCode":"HK","data":[{"type":"s5","score":0.5,"level":"C"},{"type":"s2","score":0.8,"level":"B"}]}
reids码表格式(元数据):
大区 国家
hset areas AREA_US US
hset areas AREA_CT TW,HK
hset areas AREA_AR PK,KW,SA
hset areas AREA_IN IN
操作:
HKEYS areas
HGETALL areas
2 实时报表
2.1 需求背景
主要针对直播/短视频平台审核指标的统计
•1:统计不同大区每1 min内过审(上架)视频的数据量(单词的个数)
分析一下:
统计的是大区,不同的大区,大区应该就是一个分组的字段,每分钟(时间)的有效视频(Process时间,事件的事件?)
每分钟【1:事件时间 2:加上水位,这样的话,我们可以挽救一些数据。3:收集数据延迟过多的数据】的不同大区的【有效视频】的数量(单词计数)
PM:产品经理
•2:统计不同大区每1 min内未过审(下架)的数据量
我们公司的是一个电商的平台(京东,淘宝)
京东 -》 店主 -〉 上架商品 -》 通过审核了,可以上架了,有效商品数
每分钟的不同主题的有效商品数。
【衣服】
【鞋】
【书】
【电子产品】
淘宝 -》 店主 -〉 上架商品 -》 未通过审核,下架 -〉 无效的商品数
2.2 项目架构
2.3 方案设计
日志格式:
如果下一条数据就代表一个有效视频。
-
统计的过去的一分钟的每个大区的有效视频数量
-
统计的过去的一分钟的每个大区的,不同类型的有效视频数量
统计的过去一分钟是每个单词出现次数。
{"dt":"2019-11-20 15:09:43","type":"child_unshelf","username":"shenhe5","area":"AREA_ID"}
{"dt":"2019-11-20 15:09:44","type":"chlid_shelf","username":"shenhe2","area":"AREA_ID"}
{"dt":"2019-11-20 15:09:45","type":"black","username":"shenhe2","area":"AREA_US"}
{"dt":"2019-11-20 15:09:46","type":"chlid_shelf","username":"shenhe3","area":"AREA_US"}
{"dt":"2019-11-20 15:09:47","type":"unshelf","username":"shenhe3","area":"AREA_ID"}
{"dt":"2019-11-20 15:09:48","type":"black","username":"shenhe4","area":"AREA_IN"}
pom文件:
<properties>
<flink.version>1.9.0</flink.version>
<scala.version>2.11.8</scala.version>
</properties>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.3</version>
</dependency>
<!-- 日志相关依赖 -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
<!-- redis依赖 -->
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.0</version>
</dependency>
<!-- json依赖 -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.44</version>
</dependency>
<!--es依赖-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch6_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
</dependencyManagement>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<testExcludes>
<testExclude>/src/test/**</testExclude>
</testExcludes>
<encoding>utf-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<id>compile-scala</id>
<phase>compile</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile-scala</id>
<phase>test-compile</phase>
<goals>
<goal>add-source</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- 指定在打包节点执行jar包合并操作 -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>