日志数据的采集:flume 落盘到HDFS,以天为单位,一天 一个文件夹
spark:对日志文件进行处理、加工、最后的落盘还是 HDFS,只不过 换了一个文件夹,处理 后的文件映射到 数仓中的ODS的表
或者
hive+udf:对日志文件进行处理、加工、最后的落盘还是 HDFS,只不过 换了一个文件夹,处理 后的文件映射到 数仓中的ODS的表
spark 预处理过程:扁平化,fastJson类库event字段转换成Map其他都变成 字段
业务规则 :
过滤掉日志中: uid|uuid|mac|imei|androidId|imsi 全为空的记录
过滤掉缺少关键字段的记录,eventid、sessionId、event 缺任何一个都不可
使用广播变量,将经纬度信息广播出去。
使用Map的get获取相应的省市区信息
hive做预处理:映射成表:line-->json
扁平化
get_json_object(line,'$.eventid') as eventid,
str_to_map(regexp_replace(get_json_object(line,'$.
event'),'\\{|\\}|\\"','')) as event,
get_json_object(line,'$.user.phone.mac') as mac
业务规则:where 指定条件
经纬度转换:join geohash编码 UDF
日志的预处理:
清洗过滤:
1.去除json中废弃的字段,email、phoneNbr、gender、isLogin、addr、isRegister
2.过滤掉日志中: uid|uuid|mac|imei|androidId|imsi 全为空的记录
3.过滤掉缺少关键字段的记录,eventid、sessionId、event 缺任何一个都不可
4.过滤掉jons格式不正确的记录
数据解析:
将json格式扁平化,event字段保留,不需要扁平化
数据集成:
1.将日志中的经纬度信息解析成省、市、区(县)为方便进行地域维度的分析、
2.ip地址的映射
3.集成商圈信息
数据修正:
1.guid回补
2.字段名称规范化
3.字段度量规范化:时间戳统一用秒级
4.字段类型规范:时间戳统一用长整型
json数据扁平化:
日志文件格式:json格式 数据有嵌套,表用map
例子:
pom.xml文件:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>LogTest</artifactId>
<version>1.0-SNAPSHOT</version>
<name>scala-demo-project</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.11.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-dbcp2</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.2.0</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scalikejdbc/scalikejdbc -->
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc_2.11</artifactId>
<version>3.1.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.scalikejdbc/scalikejdbc-config -->
<dependency>
<groupId>org.scalikejdbc</groupId>
<artifactId>scalikejdbc-config_2.11</artifactId>
<version>3.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-graphx_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.69</version>
</dependency>
<dependency>
<groupId>ch.hsr</groupId>
<artifactId>geohash</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.1.10</version>
</dependency>
<dependency>
<groupId>redis.clients</groupId>
<artifactId>jedis</artifactId>
<version>2.9.3</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api-scala_2.11</artifactId>
<version>11.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>cn.kgc.kafak.demo.ThreadProducer</mainClass>
</transformer>
</transformers>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude><META-I></META-I>NF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
数据集成:
1、连接mysql创建码表geomap
geomap.sql :
/*
SQLyog Ultimate v12.09 (64 bit)
MySQL - 5.6.50
*********************************************************************
*/
/*!40101 SET NAMES utf8 */;
create table `geomap` (
`lag` float ,
`lat` float ,
`province` varchar (120),
`city` varchar (120),
`district` varchar (120)
);
insert into `geomap` (`lag`, `lat`,