版本环境:
Hadoop:3.0 -CDH6.2.1
Spark:2.4-CDH6.2.1
HBase:2.1.0-CDH6.2.1
Phoenix:5.0.0-cdh6.2.0.p0.1308267
基本架构:
为什么要记录用户访问行为日志?
- 1.网站页面的访问量
- 2.网站的黏性 用户使用web端或者App端的多次点击,链接点击
用户行为日志内容:
客户端
模块 app ID
跳转链接地址
访问ip地址
访问者账号
访问时间和区域
用户行为日志分析的意义
- 1.网站的眼睛
- 2.网站的神经
- 3.网站的大脑
Python网页访问数据生成
Python实时日志产生器开发
使用Sublime进行Python脚本开发
#coding=UTF-8
import random
import time
url_paths = [
"class/112.html",
"class/128.html",
"class/145.html",
"class/146.html",
"class/131.html",
"class/130.html",
"learn/821",
"course/list"
]
ip_slices = [132,156,124,10,29,167,143,187,30,46,55,63,72,87,98]
http_referers = [
"https://www.baidu.com/s?wd={query}",
"https://www.sogou.com/web?query={query}",
"https://cn.bing.com/search?q={query}",
"https://search.yahoo.com/search?p={query}",
]
search_keyword = [
"Spark SQL实战",
"Hadoop基础",
"Storm实战",
"Spark Streaming实战",
"大数据面试"
]
# 定义状态码
status_codes = ["200","404","500"]
# 抽样url
def sample_url():
return random.sample(url_paths,1)[0]
# 抽样ip 从list中抽取拼接
def sample_ip():
slice = random.sample(ip_slices,4)
return ".".join([str(item) for item in slice])
# 抽样refer拼出来
def sample_referer():
if random.uniform(0,1) > 0.2:
return "-"
refer_str = random.sample(http_referers,1)
query_str = random.sample(search_keyword,1)
return refer_str[0].format(query=query_str[0])
# 生成状态码
def sample_status_code():
return random.sample(status_codes,1)[0]
# 生成日志
def generate_log(count = 10):
time_str = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime())
# 写入access.log文件 服务器上
f = open("/usr/whl/streamingproject/access.log","w+")
while count>=1:
query_log = "{ip}\t{localtime}\t\"GET /{url} HTTP/1.1\"\t{status}\t{referer}".format(url = sample_url(),ip=sample_ip(),referer=sample_referer(),status=sample_status_code(),localtime=time_str)
print(query_log)
f.write(query_log + "\n")
count = count - 1
if __name__ == '__main__':
generate_log(100)
设置定时任务 每一分钟执行一次
Linux Crontab
网站:https://tool.lu/crontab
Crontab 表达式:* */1 * * *
每一分钟执行一次
crontab -e 执行下面脚本
* */1 * * * /usr/whl/streamingproject/log.sh
日志样式如下:
10.87.187.46 2019-12-04 16:46:01 "GET /class/130.html HTTP/1.1" 404 -
187.156.46.143 2019-12-04 16:46:01 "GET /course/list HTTP/1.1" 500 -
63.167.46.55 2019-12-04 16:46:01 "GET /class/145.html HTTP/1.1" 500 -
46.10.156.72 2019-12-04 16:46:01 "GET /course/list HTTP/1.1" 404 -
187.72.124.87 2019-12-04 16:46:01 "GET /class/131.html HTTP/1.1" 200 -
187.46.143.72 2019-12-04 16:46:01 "GET /class/146.html HTTP/1.1" 404 -
10.55.63.46 2019-12-04 16:46:01 "GET /class/112.html HTTP/1.1" 404 https://www.sogou.com/web?query=Spark SQL实战
63.132.55.187 2019-12-04 16:46:01 "GET /class/145.html HTTP/1.1" 200 -
46.63.156.72 2019-12-04 16:46:01 "GET /class/128.html HTTP/1.1" 404 https://www.sogou.com/web?query=Hadoop基础
156.72.167.63 2019-12-04 16:46:01 "GET /class/131.html HTTP/1.1" 500 -
55.63.72.46 2019-12-04 16:46:01 "GET /class/146.html HTTP/1.1" 404 -
55.29.98.46 2019-12-04 16:46:01 "GET /class/146.html HTTP/1.1" 200 -
87.167.124.29 2019-12-04 16:46:01 "GET /class/130.html HTTP/1.1" 404
实时处理架构
Flume&Kafka&Spark Streaming路线
Flume应用![在这里插入图片描述](https://img-blog.csdnimg.cn/20191216175941873.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxNzk0Mjg1,size_16,color_FFFFFF,t_70)
Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构,灵活简单.
应用
对接Python日志服务器输出的日志到Flume
选项:access.log ==>控制台输出
exec
memory
logger
Agent配置如下
文件名:streaming.conf
exec-memory-logger.sources = exec-source
exec-memory-logger.sinks = logger-sink
exec-memory-logger.channels = memory-channel
exec-memory-logger.sources.exec-source.type = exec
exec-memory-logger.sources.exec-source.command = tail -f /usr/whl/streamingproject/access.log
exec-memory-logger.sources.shell = /bin/sh -c
exec-memory-logger.channels.memory-channel.type = memory
exec-memory-logger.sinks.logger-sink.type = logger
exec-memory-logger.sources.exec-source.channels = memory-channel
exec-memory-logger.sinks.logger-sink.channel = memory-channel
启动agent命令
flume-ng agent \
--name exec-memory-logger \
--conf $FLUME_HOME/conf \
--conf-file /usr/whl/streamingproject/streaming.conf \
-Dflume.root.logger=INFO,console
如打印类似信息
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 32 39 2E 31 35 36 2E 31 32 34 2E 36 33 09 32 30 29.156.124.63.20 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 31 32 34 2E 35 35 2E 31 30 2E 31 35 36 09 32 30 124.55.10.156.20 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 31 30 2E 31 33 32 2E 37 32 2E 34 36 09 32 30 31 10.132.72.46.201 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 31 30 2E 36 33 2E 31 33 32 2E 31 32 34 09 32 30 10.63.132.124.20 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 35 35 2E 31 33 32 2E 38 37 2E 31 32 34 09 32 30 55.132.87.124.20 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 31 32 34 2E 31 30 2E 33 30 2E 35 35 09 32 30 31 124.10.30.55.201 }
则表明控制台采集了实时产生的访问日志
Kafka应用
简介
Kafka,分布式消息发布&订阅系统,流处理平台
1.发布-订阅流式记录
2.存储流式记录,有较好的容错性
3.可以在流式记录产生时就进行处理 Kafka Streaming
应用场景
1.构建实时流数据管道,在系统和应用间有效的获取数据
2.构建实时流式应用程序
/opt/cloudera/parcels/KAFKA-4.1.0-1.4.1.0.p0.4/bin
下后台启动kafka server
./kafka-server-start -daemon $KAFKA_HOME/config/server.properties
修改Flume配置文件 写入kafka
streaming_kafka.conf
exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel
exec-memory-kafka.channels.memory-channel.byteCapacity = 800000
exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -f /usr/whl/streamingproject/access.log
exec-memory-kafka.sources.shell = /bin/sh -c
exec-memory-kafka.channels.memory-channel.type = memory
exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = 172.20.0.207:9092
exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic
exec-memory-kafka.sinks.kafka-sink.batchsize = 5
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1
exec-memory-kafka.sources.exec-source.channels = memory-channel
首先开启kafka控制台消费信息
kafka-console-consumer --bootstrap-server 172.20.0.207:9092 --topic streamingtopic
然后启动Flume agent
flume-ng agent \
--name exec-memory-kafka \
--conf $FLUME_HOME/conf \
--conf-file /usr/whl/streamingproject/streaming_kafka.conf \
-Dflume.root.logger=INFO,console
-
Kafka重置topic偏移量命令:
bin/kafka-streams-application-reset.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --bootstrap-servers sandbox-hdp.hortonworks.com:6667 --application-id it21learning-event-attendees-streamming --input-topics event_attendees_raw
遇到问题:
org.apache.flume.ChannelFullException: Space for commit to queue couldn't be acquired. Sinks are likely not keeping up with sources, or the buffer size is too tight
调大
exec-memory-kafka.channels.memory-channel.byteCapacity = 800000
byteCapacity即可
详情见:
https://blog.csdn.net/gaopu12345/article/details/77922924
Spark应用程序
Idea scala版本 2.11.8
pom.xml 依赖如下
spark版本必须与对应scala版本一致,否则工程无法运行
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.4.0</version>
<scope>compile</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
SparkStreaming消费Kafka数据有两种方式
- 1.基于Receiver方式
这种方式利用接收器(Receiver)来接收kafka中的数据,其最基本是使用Kafka高阶用户API接口。对于所有的接收器,从kafka接收来的数据会存储在spark的executor中,之后spark streaming提交的job会处理这些数据。
- 2.直接读取方式
在spark1.3之后,引入了Direct方式。不同于Receiver的方式,Direct方式没有receiver这一层,其会周期性的获取Kafka中每个topic的每个partition中的最新offsets,之后根据设定的maxRatePerPartition来处理每个batch。
本例使用第一种方式 高阶API
工程引入sdk:选择scala
包结构
package project
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
object StreamingAppDemo {
def main(args: Array[String]): Unit = {
if (args.length!=4){
println("Usage:Streamdemo <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum,groupId,topics,numThreads) = args
/**
* 用于本地测试
*/
val sparkConf = new SparkConf().setAppName("Streamdemo").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(60))
val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
/**
* 使用KafkaUtils的Receiver方式,可以由Zookeeper自动记录偏移量,无需手动重置
*/
val messages = KafkaUtils.createStream(ssc,zkQuorum,groupId,topicMap)
/**
* 测试步骤一:测试数据接收 测试用例 wordcount
*/
messages.map(_._2).count().print()
ssc.start()
ssc.awaitTermination()
}
}
参数配置如下:
出现上述信息说明成功消费数据
使用SparkStreaming进行数据清洗
数据清洗操作:从原始日志中抽取我们需要的字段信息
包结构:
1.解析日期工具类
package project.utils
import java.util.Date
import org.apache.commons.lang3.time.FastDateFormat
/**
* 日期时间工具类
*/
object DateUtils {
//原来格式
val YYYYMMDDHHMMSS_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")
//目标格式
val TARGE_FORMAT = FastDateFormat.getInstance("yyyyMMddHHmmss")
//获取原始时间
def getTime(time:String) = {
YYYYMMDDHHMMSS_FORMAT.parse(time).getTime
}
//时间转换
def parseToMinute(time:String) = {
TARGE_FORMAT.format(new Date(getTime(time)))
}
def main(args: Array[String]): Unit = {
//测试时间转换
// println(parseToMinute("2019-12-05 17:03:01"))
}
}
2.样例类(实体类)ClickLog
package project.domain
/**
* 清洗后的日志信息:定义为样例类
* @param ip 日志访问的ip
* @param time 日志访问的时间
* @param courseId 日志访问的实战课程编号
* @param statusCode 日志访问的状态码
* @param referer 日志访问的referer信息
*/
case class ClickLog(ip:String,time:String,courseId:Int,statusCode:Int,referer:String) {
}
3.StreamingAppDemo
运行该类 必须保证打开flume客户端收集数据到Kafka
package project.spark
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import project.domain.ClickLog
import project.utils.DateUtils
object StreamingAppDemo {
def main(args: Array[String]): Unit = {
if (args.length!=4){
println("Usage:Streamdemo <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum,groupId,topics,numThreads) = args
/**
* 用于本地测试
*/
val sparkConf = new SparkConf().setAppName("Streamdemo").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf,Seconds(60))
val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
/**
* 使用KafkaUtils的Receiver方式,可以由Zookeeper自动记录偏移量,无需手动重置
*/
val messages = KafkaUtils.createStream(ssc,zkQuorum,groupId,topicMap)
/**
* 测试步骤一:测试数据接收
*/
// messages.map(_._2).count().print()
/**
* 测试步骤二:数据清洗
*/
val logs = messages.map(_._2)
val cleanData = logs.map(line =>{
//"GET /class/112.html HTTP/1.1"
val infos = line.split("\t")
///class/112.html HTTP/1.1
val url = infos(2).split(" ")(1)
var courseId = 0
//把课程编号拿到了
if(url.startsWith("/class")){
//112.html
val courseIdHTML = url.split("/")(2)
//112
courseId = courseIdHTML.substring(0,courseIdHTML.lastIndexOf(".")).toInt
}
//将清洗后的数据装载进ClickLog样例类
ClickLog(infos(0),DateUtils.parseToMinute(infos(1)),courseId,infos(3).toInt,infos(4))
//filter 的目的就是把clicklog中不符合课程编号格式的信息过滤
//"GET /course/list HTTP/1.1" 如这条数据
}).filter(clicklog =>clicklog.courseId!=0)
//DStream需要转换算子执行
cleanData.print()
ssc.start()
ssc.awaitTermination()
}
}
清洗结果:
存储结果置入HBase提供即席查询
yyyyMMdd courseid
1.使用数据库来进行存储我们的统计结果
2.Spark Streaming将统计结果写入数据库
3.可视化前端根据:yyyyMMdd courseid 把数据库里面的统计结果进行展示
4.选择数据存储?
RDBMS:MySQL、Oracle…
day | course_id | click_count |
---|---|---|
20191205 | 1 | 10 |
20191205 | 2 | 20 |
下一个批次数据进来以后:
201912 + 1 ==> click_count + 下一个批次的统计结果
这种操作很麻烦
NoSQL:HBase、Redis…
使用HBase优势:一个API即可,非常方便
HBase表设计
创建课程表,列族为info
create 'hb_course_clickcount','info'
RowKey设计:
daycourseid (这样容易造成热点问题,其实需要加盐处理)
Scala版 HBaseUtils
package project.utils
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{ConnectionFactory, Put,Table}
object HBaseUtil {
def insert(table:Table,rowkey:String,columnFamily:String,columnName:String,value:String): Unit ={
val put = new Put(rowkey.getBytes())
put.addColumn(columnFamily.getBytes(),columnName.getBytes(),value.getBytes())
table.put(put)
}
}
将SparkStreaming处理结果写入HBase
package project.spark
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import project.domain.ClickLog
import project.utils.{DateUtils, HBaseUtil}
object StreamingAppDemo {
def main(args: Array[String]): Unit = {
if (args.length!=4){
println("Usage:Streamdemo <zkQuorum> <group> <topics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum,groupId,topics,numThreads) = args
/**
* 用于本地测试
*/
val sparkConf = new SparkConf().setAppName("Streamdemo").setMaster("local[8]")
val ssc = new StreamingContext(sparkConf,Seconds(60))
val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
/**
* 使用KafkaUtils的Receiver方式,可以由Zookeeper自动记录偏移量,无需手动重置
*/
val messages = KafkaUtils.createStream(ssc,zkQuorum,groupId,topicMap)
/**
* 测试步骤一:测试数据接收
*/
// messages.map(_._2).count().print()
/**
* 测试步骤二:数据清洗
*/
println(messages.map(_._1))
val logs = messages.map(_._2)
val cleanData = logs.map(line =>{
//"GET /class/112.html HTTP/1.1"
val infos = line.split("\t")
///class/112.html HTTP/1.1
val url = infos(2).split(" ")(1)
var courseId = 0
//把课程编号拿到了
if(url.startsWith("/class")){
//112.html
val courseIdHTML = url.split("/")(2)
//112
courseId = courseIdHTML.substring(0,courseIdHTML.lastIndexOf(".")).toInt
}
//将清洗后的数据装载进ClickLog样例类
ClickLog(infos(0),DateUtils.parseToMinute(infos(1)),courseId,infos(3).toInt,infos(4))
//filter 的目的就是把clicklog中不符合课程编号格式的信息过滤
//"GET /course/list HTTP/1.1" 如这条数据
}).filter(clicklog =>clicklog.courseId!=0)
//DStream需要转换算子执行
cleanData.print()
val hbaseSinkData = cleanData.foreachRDD(rdd=>{
rdd.foreachPartition(partitionOfRecords =>{
//设置表名和HBase配置
val tableName = "hb_course_clickcount"
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum", "172.20.0.207")
hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
val HBtable = TableName.valueOf(tableName)
val conn = ConnectionFactory.createConnection(hbaseConf)
val table =conn.getTable(HBtable)
//插入数据
partitionOfRecords.foreach(pair=>{
HBaseUtil.insert(table,(pair.courseId+pair.time),"info","courseId",pair.courseId.toString)
HBaseUtil.insert(table,(pair.courseId+pair.time),"info","ip",pair.ip.toString)
HBaseUtil.insert(table,(pair.courseId+pair.time),"info","time",pair.time.toString)
HBaseUtil.insert(table,(pair.courseId+pair.time),"info","status",pair.statusCode.toString)
HBaseUtil.insert(table,(pair.courseId+pair.time),"info","referer",pair.referer.toString)
})
})
})
println(hbaseSinkData)
ssc.start()
ssc.awaitTermination()
}
}
使用Phoenix查询HBase原始数据
简介:
Phoenix是一个在Hbase上面实现的基于Hadoop的OLTP技术,具有低延迟、事务性、可使用SQL、提供JDBC接口的特点。 而且Phoenix还提供了Hbase二级索引的解决方案,丰富了Hbase查询的多样性,继承了Hbase海量数据快速随机查询的特点。
Phoenix完全使用Java编写,作为HBase内嵌的JDBC驱动。Phoenix查询引擎会将SQL查询转换为一个或多个HBase扫描,并编排执行以生成标准的JDBC结果集。直接使用HBase API、协同处理器与自定义过滤器,对于简单查询来说,其性能量级是毫秒,对于百万级别的行数来说,其性能量级是秒。
首先明确一个概念:
1.在Phoenix中建的表会在HBase和Phoenix中显示
2.HBase原生表需要在Phoenix中构建视图才可以进行查询
Phoenix启动命令:
phoenix-sqlline 172.20.0.207:2181
构建视图:映射HBase中"hb_course_clickcount表"
create view "hb_course_clickcount" ("ROW" varchar primary key,"info"."courseId" varchar,"info"."ip" varchar,"info"."referer" varchar,"info"."status" varchar,"info"."time" varchar)as select * from "hb_course_clickcount";
进行HBase即席查询
select * from "hb_course_clickcount" limit 5
MVN打包之后 运行在服务器上,即可完成日志收集与清洗的实时处理操作