Flume+Kafka+Spark Stremming+HBase+Phoenix实现日志数据处理分析

版本环境:

Hadoop:3.0 -CDH6.2.1

Spark:2.4-CDH6.2.1

HBase:2.1.0-CDH6.2.1

Phoenix:5.0.0-cdh6.2.0.p0.1308267

基本架构:
在这里插入图片描述
为什么要记录用户访问行为日志?

  • 1.网站页面的访问量
  • 2.网站的黏性 用户使用web端或者App端的多次点击,链接点击
    用户行为日志内容:

客户端

模块 app ID

跳转链接地址

访问ip地址

访问者账号

访问时间和区域

用户行为日志分析的意义

  • 1.网站的眼睛
  • 2.网站的神经
  • 3.网站的大脑

Python网页访问数据生成

Python实时日志产生器开发

使用Sublime进行Python脚本开发

#coding=UTF-8
import random
import time

url_paths = [
	"class/112.html",
	"class/128.html",
	"class/145.html",
	"class/146.html",
	"class/131.html",
	"class/130.html",
	"learn/821",
	"course/list"
]

ip_slices = [132,156,124,10,29,167,143,187,30,46,55,63,72,87,98]


http_referers = [
	"https://www.baidu.com/s?wd={query}",
	"https://www.sogou.com/web?query={query}",
	"https://cn.bing.com/search?q={query}",
	"https://search.yahoo.com/search?p={query}",
]

search_keyword = [
	"Spark SQL实战",
	"Hadoop基础",
	"Storm实战",
	"Spark Streaming实战",
	"大数据面试"
]

# 定义状态码
status_codes = ["200","404","500"]

# 抽样url
def sample_url():
	return random.sample(url_paths,1)[0]

# 抽样ip 从list中抽取拼接
def sample_ip():
	slice = random.sample(ip_slices,4)
	return ".".join([str(item) for item in slice])

# 抽样refer拼出来
def  sample_referer():
	if random.uniform(0,1) > 0.2:
		return "-"
	refer_str = random.sample(http_referers,1)
	query_str = random.sample(search_keyword,1)
	return refer_str[0].format(query=query_str[0])
# 生成状态码
def sample_status_code():
	return random.sample(status_codes,1)[0]

# 生成日志
def generate_log(count = 10):
	time_str = time.strftime("%Y-%m-%d %H:%M:%S",time.localtime())
# 写入access.log文件 服务器上
	f = open("/usr/whl/streamingproject/access.log","w+")
	while count>=1:
		query_log = "{ip}\t{localtime}\t\"GET /{url} HTTP/1.1\"\t{status}\t{referer}".format(url = sample_url(),ip=sample_ip(),referer=sample_referer(),status=sample_status_code(),localtime=time_str)
		print(query_log)
		f.write(query_log + "\n")
		count = count - 1
if __name__ == '__main__':
	generate_log(100)

设置定时任务 每一分钟执行一次

Linux Crontab
	网站:https://tool.lu/crontab
	Crontab 表达式:* */1 * * *
	每一分钟执行一次
	crontab -e 执行下面脚本 
	* */1 * * * /usr/whl/streamingproject/log.sh

日志样式如下:

10.87.187.46	2019-12-04 16:46:01	"GET /class/130.html HTTP/1.1"	404	-
187.156.46.143	2019-12-04 16:46:01	"GET /course/list HTTP/1.1"	500	-
63.167.46.55	2019-12-04 16:46:01	"GET /class/145.html HTTP/1.1"	500	-
46.10.156.72	2019-12-04 16:46:01	"GET /course/list HTTP/1.1"	404	-
187.72.124.87	2019-12-04 16:46:01	"GET /class/131.html HTTP/1.1"	200	-
187.46.143.72	2019-12-04 16:46:01	"GET /class/146.html HTTP/1.1"	404	-
10.55.63.46	2019-12-04 16:46:01	"GET /class/112.html HTTP/1.1"	404	https://www.sogou.com/web?query=Spark SQL实战
63.132.55.187	2019-12-04 16:46:01	"GET /class/145.html HTTP/1.1"	200	-
46.63.156.72	2019-12-04 16:46:01	"GET /class/128.html HTTP/1.1"	404	https://www.sogou.com/web?query=Hadoop基础
156.72.167.63	2019-12-04 16:46:01	"GET /class/131.html HTTP/1.1"	500	-
55.63.72.46	2019-12-04 16:46:01	"GET /class/146.html HTTP/1.1"	404	-
55.29.98.46	2019-12-04 16:46:01	"GET /class/146.html HTTP/1.1"	200	-
87.167.124.29	2019-12-04 16:46:01	"GET /class/130.html HTTP/1.1"	404	

实时处理架构

Flume&Kafka&Spark Streaming路线
Flume应用在这里插入图片描述

Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume基于流式架构,灵活简单.
应用

对接Python日志服务器输出的日志到Flume

选项:access.log  ==>控制台输出

​	exec

​	memory

​	logger


Agent配置如下
文件名:streaming.conf
exec-memory-logger.sources = exec-source
exec-memory-logger.sinks = logger-sink
exec-memory-logger.channels = memory-channel
exec-memory-logger.sources.exec-source.type = exec
exec-memory-logger.sources.exec-source.command = tail -f /usr/whl/streamingproject/access.log
exec-memory-logger.sources.shell = /bin/sh -c
exec-memory-logger.channels.memory-channel.type = memory
exec-memory-logger.sinks.logger-sink.type = logger
exec-memory-logger.sources.exec-source.channels = memory-channel
exec-memory-logger.sinks.logger-sink.channel = memory-channel


启动agent命令
flume-ng agent \
--name exec-memory-logger \
--conf $FLUME_HOME/conf \
--conf-file /usr/whl/streamingproject/streaming.conf \
-Dflume.root.logger=INFO,console

如打印类似信息
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 32 39 2E 31 35 36 2E 31 32 34 2E 36 33 09 32 30 29.156.124.63.20 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 31 32 34 2E 35 35 2E 31 30 2E 31 35 36 09 32 30 124.55.10.156.20 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 31 30 2E 31 33 32 2E 37 32 2E 34 36 09 32 30 31 10.132.72.46.201 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 31 30 2E 36 33 2E 31 33 32 2E 31 32 34 09 32 30 10.63.132.124.20 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 35 35 2E 31 33 32 2E 38 37 2E 31 32 34 09 32 30 55.132.87.124.20 }
19/12/04 17:16:05 INFO sink.LoggerSink: Event: { headers:{} body: 31 32 34 2E 31 30 2E 33 30 2E 35 35 09 32 30 31 124.10.30.55.201 }
则表明控制台采集了实时产生的访问日志
Kafka应用

在这里插入图片描述

简介

Kafka,分布式消息发布&订阅系统,流处理平台

1.发布-订阅流式记录

2.存储流式记录,有较好的容错性

3.可以在流式记录产生时就进行处理 Kafka Streaming

应用场景

1.构建实时流数据管道,在系统和应用间有效的获取数据

2.构建实时流式应用程序

/opt/cloudera/parcels/KAFKA-4.1.0-1.4.1.0.p0.4/bin
下后台启动kafka server
./kafka-server-start -daemon $KAFKA_HOME/config/server.properties

修改Flume配置文件 写入kafka

streaming_kafka.conf

exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel
exec-memory-kafka.channels.memory-channel.byteCapacity = 800000
exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -f /usr/whl/streamingproject/access.log
exec-memory-kafka.sources.shell = /bin/sh -c
exec-memory-kafka.channels.memory-channel.type = memory
exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.brokerList = 172.20.0.207:9092
exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic
exec-memory-kafka.sinks.kafka-sink.batchsize = 5
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1
exec-memory-kafka.sources.exec-source.channels = memory-channel


首先开启kafka控制台消费信息
kafka-console-consumer --bootstrap-server 172.20.0.207:9092  --topic streamingtopic



然后启动Flume agent

flume-ng agent \
--name exec-memory-kafka \
--conf $FLUME_HOME/conf \
--conf-file /usr/whl/streamingproject/streaming_kafka.conf \
-Dflume.root.logger=INFO,console
-


Kafka重置topic偏移量命令:
bin/kafka-streams-application-reset.sh --zookeeper sandbox-hdp.hortonworks.com:2181 --bootstrap-servers sandbox-hdp.hortonworks.com:6667 --application-id it21learning-event-attendees-streamming --input-topics event_attendees_raw


遇到问题:
org.apache.flume.ChannelFullException: Space for commit to queue couldn't be acquired. Sinks are likely not keeping up with sources, or the buffer size is too tight

调大
exec-memory-kafka.channels.memory-channel.byteCapacity = 800000
byteCapacity即可
详情见:
https://blog.csdn.net/gaopu12345/article/details/77922924
Spark应用程序

Idea scala版本 2.11.8

pom.xml 依赖如下

spark版本必须与对应scala版本一致,否则工程无法运行

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.4.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.4.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.11</artifactId>
      <version>2.4.0</version>
      <scope>compile</scope>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-8 -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
      <version>2.4.0</version>
    </dependency>

  </dependencies>

SparkStreaming消费Kafka数据有两种方式

  • 1.基于Receiver方式

这种方式利用接收器(Receiver)来接收kafka中的数据,其最基本是使用Kafka高阶用户API接口。对于所有的接收器,从kafka接收来的数据会存储在spark的executor中,之后spark streaming提交的job会处理这些数据。

  • 2.直接读取方式

在spark1.3之后,引入了Direct方式。不同于Receiver的方式,Direct方式没有receiver这一层,其会周期性的获取Kafka中每个topic的每个partition中的最新offsets,之后根据设定的maxRatePerPartition来处理每个batch。

本例使用第一种方式 高阶API

工程引入sdk:选择scala
在这里插入图片描述
包结构
在这里插入图片描述

package project

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils

object StreamingAppDemo {
  def main(args: Array[String]): Unit = {
    if (args.length!=4){
      println("Usage:Streamdemo <zkQuorum> <group> <topics> <numThreads>")
      System.exit(1)
    }
    val Array(zkQuorum,groupId,topics,numThreads) = args

    /**
      * 用于本地测试
      */
    val sparkConf = new SparkConf().setAppName("Streamdemo").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf,Seconds(60))
    val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
    /**
      * 使用KafkaUtils的Receiver方式,可以由Zookeeper自动记录偏移量,无需手动重置
      */

    val messages = KafkaUtils.createStream(ssc,zkQuorum,groupId,topicMap)

    /**
      * 测试步骤一:测试数据接收  测试用例 wordcount
      */
    messages.map(_._2).count().print()
    ssc.start()
    ssc.awaitTermination()
  }
}

参数配置如下:
在这里插入图片描述
在这里插入图片描述
出现上述信息说明成功消费数据

使用SparkStreaming进行数据清洗

数据清洗操作:从原始日志中抽取我们需要的字段信息

包结构:
在这里插入图片描述
1.解析日期工具类

package project.utils

import java.util.Date

import org.apache.commons.lang3.time.FastDateFormat

/**
  * 日期时间工具类
  */
object DateUtils {
  //原来格式
  val YYYYMMDDHHMMSS_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")
  //目标格式
  val TARGE_FORMAT = FastDateFormat.getInstance("yyyyMMddHHmmss")
  //获取原始时间
  def getTime(time:String) = {
    YYYYMMDDHHMMSS_FORMAT.parse(time).getTime
  }
  //时间转换
  def  parseToMinute(time:String) = {
    TARGE_FORMAT.format(new Date(getTime(time)))
  }

  def main(args: Array[String]): Unit = {
    //测试时间转换
//    println(parseToMinute("2019-12-05 17:03:01"))
  }
}

2.样例类(实体类)ClickLog

package project.domain

/**
  * 清洗后的日志信息:定义为样例类
  * @param ip 日志访问的ip
  * @param time 日志访问的时间
  * @param courseId 日志访问的实战课程编号
  * @param statusCode 日志访问的状态码
  * @param referer 日志访问的referer信息
  */
case class ClickLog(ip:String,time:String,courseId:Int,statusCode:Int,referer:String) {

}

3.StreamingAppDemo
运行该类 必须保证打开flume客户端收集数据到Kafka

package project.spark

import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import project.domain.ClickLog
import project.utils.DateUtils

object StreamingAppDemo {
  def main(args: Array[String]): Unit = {
    if (args.length!=4){
      println("Usage:Streamdemo <zkQuorum> <group> <topics> <numThreads>")
      System.exit(1)
    }
    val Array(zkQuorum,groupId,topics,numThreads) = args

    /**
      * 用于本地测试
      */
    val sparkConf = new SparkConf().setAppName("Streamdemo").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf,Seconds(60))
    val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
    /**
      * 使用KafkaUtils的Receiver方式,可以由Zookeeper自动记录偏移量,无需手动重置
      */

    val messages = KafkaUtils.createStream(ssc,zkQuorum,groupId,topicMap)

    /**
      * 测试步骤一:测试数据接收
      */
//    messages.map(_._2).count().print()
    /**
      * 测试步骤二:数据清洗
      */
    val logs = messages.map(_._2)
    val cleanData = logs.map(line =>{
      //"GET /class/112.html HTTP/1.1"
      val infos = line.split("\t")
      ///class/112.html HTTP/1.1
      val url = infos(2).split(" ")(1)
      var courseId = 0
      //把课程编号拿到了
      if(url.startsWith("/class")){
        //112.html
        val courseIdHTML = url.split("/")(2)
        //112
        courseId = courseIdHTML.substring(0,courseIdHTML.lastIndexOf(".")).toInt
      }
      //将清洗后的数据装载进ClickLog样例类
        ClickLog(infos(0),DateUtils.parseToMinute(infos(1)),courseId,infos(3).toInt,infos(4))
      //filter 的目的就是把clicklog中不符合课程编号格式的信息过滤
      //"GET /course/list HTTP/1.1"  如这条数据
    }).filter(clicklog =>clicklog.courseId!=0)
    //DStream需要转换算子执行
    cleanData.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

清洗结果:
在这里插入图片描述

存储结果置入HBase提供即席查询

yyyyMMdd courseid

1.使用数据库来进行存储我们的统计结果

2.Spark Streaming将统计结果写入数据库

3.可视化前端根据:yyyyMMdd courseid 把数据库里面的统计结果进行展示

4.选择数据存储?

RDBMS:MySQL、Oracle…

daycourse_idclick_count
20191205110
20191205220

下一个批次数据进来以后:

​ 201912 + 1 ==> click_count + 下一个批次的统计结果

​ 这种操作很麻烦

NoSQL:HBase、Redis…

使用HBase优势:一个API即可,非常方便

HBase表设计

创建课程表,列族为info

create 'hb_course_clickcount','info'

RowKey设计:

daycourseid (这样容易造成热点问题,其实需要加盐处理)

Scala版 HBaseUtils

package project.utils


import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{ConnectionFactory, Put,Table}

object HBaseUtil {
  def insert(table:Table,rowkey:String,columnFamily:String,columnName:String,value:String): Unit ={
    val put = new Put(rowkey.getBytes())
    put.addColumn(columnFamily.getBytes(),columnName.getBytes(),value.getBytes())
    table.put(put)
  }
}

将SparkStreaming处理结果写入HBase

package project.spark


import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import project.domain.ClickLog
import project.utils.{DateUtils, HBaseUtil}


object StreamingAppDemo {
  def main(args: Array[String]): Unit = {
    if (args.length!=4){
      println("Usage:Streamdemo <zkQuorum> <group> <topics> <numThreads>")
      System.exit(1)
    }
    val Array(zkQuorum,groupId,topics,numThreads) = args

    /**
      * 用于本地测试
      */
    val sparkConf = new SparkConf().setAppName("Streamdemo").setMaster("local[8]")
    val ssc = new StreamingContext(sparkConf,Seconds(60))
    val topicMap = topics.split(",").map((_,numThreads.toInt)).toMap
    /**
      * 使用KafkaUtils的Receiver方式,可以由Zookeeper自动记录偏移量,无需手动重置
      */

    val messages = KafkaUtils.createStream(ssc,zkQuorum,groupId,topicMap)

    /**
      * 测试步骤一:测试数据接收
      */
//    messages.map(_._2).count().print()
    /**
      * 测试步骤二:数据清洗
      */
      println(messages.map(_._1))
      val logs = messages.map(_._2)
      val cleanData = logs.map(line =>{
      //"GET /class/112.html HTTP/1.1"
      val infos = line.split("\t")
      ///class/112.html HTTP/1.1
      val url = infos(2).split(" ")(1)
      var courseId = 0
      //把课程编号拿到了
      if(url.startsWith("/class")){
        //112.html
        val courseIdHTML = url.split("/")(2)
        //112
        courseId = courseIdHTML.substring(0,courseIdHTML.lastIndexOf(".")).toInt
      }
      //将清洗后的数据装载进ClickLog样例类
        ClickLog(infos(0),DateUtils.parseToMinute(infos(1)),courseId,infos(3).toInt,infos(4))
      //filter 的目的就是把clicklog中不符合课程编号格式的信息过滤
      //"GET /course/list HTTP/1.1"  如这条数据
    }).filter(clicklog =>clicklog.courseId!=0)
    //DStream需要转换算子执行

    cleanData.print()
    val hbaseSinkData = cleanData.foreachRDD(rdd=>{
      rdd.foreachPartition(partitionOfRecords =>{
        //设置表名和HBase配置
        val tableName = "hb_course_clickcount"
        val hbaseConf = HBaseConfiguration.create()
        hbaseConf.set("hbase.zookeeper.quorum", "172.20.0.207")
        hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
        val HBtable = TableName.valueOf(tableName)
        val conn = ConnectionFactory.createConnection(hbaseConf)
        val table =conn.getTable(HBtable)
        //插入数据
        partitionOfRecords.foreach(pair=>{
          HBaseUtil.insert(table,(pair.courseId+pair.time),"info","courseId",pair.courseId.toString)
          HBaseUtil.insert(table,(pair.courseId+pair.time),"info","ip",pair.ip.toString)
          HBaseUtil.insert(table,(pair.courseId+pair.time),"info","time",pair.time.toString)
          HBaseUtil.insert(table,(pair.courseId+pair.time),"info","status",pair.statusCode.toString)
          HBaseUtil.insert(table,(pair.courseId+pair.time),"info","referer",pair.referer.toString)
        })
      })
    })
    println(hbaseSinkData)
    ssc.start()
    ssc.awaitTermination()
  }

}

使用Phoenix查询HBase原始数据

简介:

Phoenix是一个在Hbase上面实现的基于Hadoop的OLTP技术,具有低延迟、事务性、可使用SQL、提供JDBC接口的特点。 而且Phoenix还提供了Hbase二级索引的解决方案,丰富了Hbase查询的多样性,继承了Hbase海量数据快速随机查询的特点。

Phoenix完全使用Java编写,作为HBase内嵌的JDBC驱动。Phoenix查询引擎会将SQL查询转换为一个或多个HBase扫描,并编排执行以生成标准的JDBC结果集。直接使用HBase API、协同处理器与自定义过滤器,对于简单查询来说,其性能量级是毫秒,对于百万级别的行数来说,其性能量级是秒。

首先明确一个概念:

1.在Phoenix中建的表会在HBase和Phoenix中显示

2.HBase原生表需要在Phoenix中构建视图才可以进行查询

Phoenix启动命令:

phoenix-sqlline 172.20.0.207:2181

构建视图:映射HBase中"hb_course_clickcount表"

create view "hb_course_clickcount" ("ROW" varchar primary key,"info"."courseId" varchar,"info"."ip" varchar,"info"."referer" varchar,"info"."status" varchar,"info"."time" varchar)as select * from "hb_course_clickcount";

进行HBase即席查询

select * from "hb_course_clickcount" limit 5

在这里插入图片描述

MVN打包之后 运行在服务器上,即可完成日志收集与清洗的实时处理操作

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值