用户行为数据采集 第7 节 Kafka压力测试、消费数据flume

上篇:用户行为数据采集 第6节 数仓采集Kafka Manager安装脚本测试


1、项目经验之Kafka压力测试

  1. Kafka压测
    用Kafka官方自带的脚本,对Kafka进行压测。Kafka压测时,可以查看到哪个地方出现了瓶颈CPU,内存,网络IO)。一般都是网络IO达到瓶颈
    kafka-consumer-perf-test.sh
    kafka-producer-perf-test.sh

  2. Kafka Producer压力测试
    (1)在module/kafka/bin目录下面有这两个文件。我们来测试一下

[root@hadoop1 kafka]# bin/kafka-producer-perf-test.sh  --topic test --record-size 100 --num-records 100000 --throughput 1000 --producer-props bootstrap.servers=hadoop1.x:9092,hadoop2.x:9092,hadoop3.x:9092

//发送延迟时间
[2020-04-08 20:30:59,529] WARN Error while fetching metadata with correlation id 3 : {test=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
5000 records sent, 1000.0 records/sec (0.10 MB/sec), 9.1 ms avg latency, 439.0 max latency.
4987 records sent, 996.8 records/sec (0.10 MB/sec), 6.4 ms avg latency, 23.0 max latency.
5020 records sent, 1004.0 records/sec (0.10 MB/sec), 4.2 ms avg latency, 23.0 max latency.
5000 records sent, 1000.0 records/sec (0.10 MB/sec), 4.0 ms avg latency, 39.0 max latency.
4991 records sent, 998.0 records/sec (0.10 MB/sec), 3.7 ms avg latency, 26.0 max latency.
5011 records sent, 1002.0 records/sec (0.10 MB/sec), 1.2 ms avg latency, 22.0 max latency.
5000 records sent, 1000.0 records/sec (0.10 MB/sec), 1.9 ms avg latency, 18.0 max latency.
5003 records sent, 1000.2 records/sec (0.10 MB/sec), 1.0 ms avg latency, 13.0 max latency.
5002 records sent, 1000.0 records/sec (0.10 MB/sec), 1.0 ms avg latency, 16.0 max latency.
5001 records sent, 1000.2 records/sec (0.10 MB/sec), 1.0 ms avg latency, 12.0 max latency.
5002 records sent, 1000.0 records/sec (0.10 MB/sec), 2.0 ms avg latency, 30.0 max latency.
5001 records sent, 1000.2 records/sec (0.10 MB/sec), 2.1 ms avg latency, 49.0 max latency.
5002 records sent, 1000.0 records/sec (0.10 MB/sec), 0.9 ms avg latency, 12.0 max latency.
5001 records sent, 1000.2 records/sec (0.10 MB/sec), 1.0 ms avg latency, 15.0 max latency.
5000 records sent, 1000.0 records/sec (0.10 MB/sec), 1.3 ms avg latency, 22.0 max latency.
5001 records sent, 1000.2 records/sec (0.10 MB/sec), 0.9 ms avg latency, 10.0 max latency.
5001 records sent, 1000.0 records/sec (0.10 MB/sec), 0.9 ms avg latency, 7.0 max latency.
5002 records sent, 1000.2 records/sec (0.10 MB/sec), 1.1 ms avg latency, 12.0 max latency.
4999 records sent, 999.8 records/sec (0.10 MB/sec), 1.0 ms avg latency, 14.0 max latency.
100000 records sent, 999.840026 records/sec (0.10 MB/sec), 2.32 ms avg latency, 439.00 ms max latency, 1 ms 50th, 11 ms 95th, 18 ms 99th, 71 ms 99.9th.
[root@hadoop1 kafka]# 

说明: record-size是一条信息有多大,单位是字节。
num-records是总共发送多少条信息。
throughput是每秒多少条信息。

Kafka会打印下面的信息,如图表示:
在这里插入图片描述
参数解析:
本例中一共写入10w条消息,每秒向Kafka写入了0.10MB的数据,平均是1000条消息/秒,每次写入的平均延迟为0.8毫秒,最大的延迟为254毫秒。

  1. Kafka Consumer压力测试
    Consumer的测试,如果这四个指标(IO,CPU,内存,网络)都不能改变,考虑增加分区数来提升性能。
[root@hadoop1 kafka]# bin/kafka-consumer-perf-test.sh --zookeeper hadoop1.x:2181 --topic test --fetch-size 10000 --messages 10000000 --threads 1

//消费数据
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec
2020-04-09 14:02:24:573, 2020-04-09 14:02:32:454, 9.5367, 1.2101, 100000, 12688.7451

参数说明:

–zookeeper 指定zookeeper的链接信息
–topic 指定topic的名称
–fetch-size 指定每次fetch的数据的大小
–messages 总共要消费的消息个数 测试结果说明: start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec
2020-04-09 14:02:24:573, 2020-04-09 14:02:32:454, 9.5367, 1.2101, 100000, 12688.7451
开始测试时间,测试结束数据,最大吞吐率9.5367MB/s,平均每秒消费1.2101/s,最大每秒消费100000条,平均每秒消费12688.7451条。


2、项目经验之Kafka机器数量计算

Kafka机器数量(经验公式)=2*(峰值生产速度*副本数/100)+1
先要预估一天大概产生多少数据,然后用Kafka自带的生产压测(只测试Kafka的写入速度,保证数据不积压),计算出峰值生产速度。再根据设定的副本数,就能预估出需要部署Kafka的数量。

比如我们采用压力测试测出写入的速度是10M/s一台,峰值的业务数据的速度是50M/s。副本数为2。
Kafka机器数量=2*(50*2/100)+ 1=3台


3、消费Kafka数据Flume

架构图:
在这里插入图片描述
集群规划

服务器hadoop1.x服务器hadoop2.x服务器hadoop3.x
Flume(消费Kafka)Flume

日志消费Flume配置

  1. Flume配置分析
    在这里插入图片描述

  2. Flume的具体配置如下:
    (1) 在hadoop1.x的module/flume-1.7.0/conf目录下创建kafka-flume-hdfs.conf文件

[root@hadoop1 conf]# vim kafka-flume-hdfs.conf

//在文件配置如下内容
## 组件
a1.sources=r1 r2
a1.channels=c1 c2
a1.sinks=k1 k2

## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop1.x:9092,hadoop2.x:9092,hadoop3.x:9092
a1.sources.r1.kafka.topics=topic_start

## source2
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = hadoop1.x:9092,hadoop2.x:9092,hadoop3.x:9092
a1.sources.r2.kafka.topics=topic_event

## channel1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /usr/local/etc/hadoop/module/flume-1.7.0/checkpoint/behavior1
a1.channels.c1.dataDirs = /usr/local/etc/hadoop/module/flume-1.7.0/data/behavior1/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6

## channel2
a1.channels.c2.type = file
a1.channels.c2.checkpointDir = /opt/module/flume/checkpoint/behavior2
a1.channels.c2.dataDirs = /usr/local/etc/hadoop/module/flume-1.7.0/data/behavior2/
a1.channels.c2.maxFileSize = 2146435071
a1.channels.c2.capacity = 1000000
a1.channels.c2.keep-alive = 6

## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = second

##sink2
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = /origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 10
a1.sinks.k2.hdfs.roundUnit = second

## 不要产生大量小文件
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0

a1.sinks.k2.hdfs.rollInterval = 10
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0

## 控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream 
a1.sinks.k2.hdfs.fileType = CompressedStream 

a1.sinks.k1.hdfs.codeC = lzop
a1.sinks.k2.hdfs.codeC = lzop

## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1

a1.sources.r2.channels = c2
a1.sinks.k2.channel= c2

配置属性说明图
在这里插入图片描述
2) 编写这个配置的脚本
(1) 在/home/MrZhou/bin目录下创建脚本f2.sh

[root@hadoop1 bin]# vim f2.sh

//在脚本中填写如下内容
#! /bin/bash

case $1 in
"start"){
        for i in hadoop3.x
        do
                echo " --------启动 $i 消费flume-------"
                ssh $i "nohup /usr/local/etc/hadoop/module/flume-1.7.0/bin/flume-ng agent --conf-file /usr/local/etc/hadoop/module/flume-1.7.0/conf/kafka-flume-hdfs.conf  --name a1 -Dflume.root.logger=INFO,LOGFILE >/usr/local/etc/hadoop/module/flume-1.7.0/log.txt   2>&1 &"
        done
};;
"stop"){
        for i in  hadoop3.x
        do
                echo " --------停止 $i 消费flume-------"
                ssh $i "ps -ef | grep kafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs kill"
        done

};;
esac

(2) 增加脚本执行权限

[root@hadoop1 bin]# chmod 777 f2.sh

(3) 查看当前进程

[root@hadoop1 bin]# ./xcall.sh jps
--------- hadoop1.x ----------
12514 ResourceManager
12195 DataNode
12933 Jps
10150 QuorumPeerMain
12630 NodeManager
10443 Kafka
12364 SecondaryNameNode
12063 NameNode
--------- hadoop2.x ----------
7728 QuorumPeerMain
8460 Jps
8013 Kafka
--------- hadoop3.x ----------
7987 Kafka
7702 QuorumPeerMain
8441 Jps

(4) f2集群启动脚本

	[root@hadoop1 bin]# ./f2.sh start
 --------启动 hadoop3.x 消费flume-------

是否启动起来,在hadoop3.x看日志

[root@hadoop3 flume-1.7.0]# ll
total 144
drwxr-xr-x.  2 root root    62 Apr  6 00:37 bin
-rw-r--r--.  1 root root 77387 Apr  6 00:37 CHANGELOG
drwxr-xr-x.  2 root root   118 Apr  7 23:05 conf
-rw-r--r--.  1 root root  6172 Apr  6 00:37 DEVNOTES
-rw-r--r--.  1 root root  2873 Apr  6 00:37 doap_Flume.rdf
drwxr-xr-x. 10 root root  4096 Apr  6 00:37 docs
drwxr-xr-x.  2 root root  4096 Apr  6 00:37 lib
-rw-r--r--.  1 root root 27625 Apr  6 00:37 LICENSE
-rw-r--r--.  1 root root  2079 Apr  9 11:12 log.txt
-rw-r--r--.  1 root root   249 Apr  6 00:37 NOTICE
-rw-r--r--.  1 root root  2520 Apr  6 00:37 README.md
-rw-r--r--.  1 root root  1585 Apr  6 00:37 RELEASE-NOTES
drwxr-xr-x.  2 root root    68 Apr  6 00:37 tools

//查看尾部日志
[root@hadoop3 flume-1.7.0]# tail -f log.txt 

//信息如下:
Info: Including Hadoop libraries found via (/usr/local/etc/hadoop/module/hadoop-2.7.2/bin/hadoop) for HDFS access
Info: Including Hive libraries found via () for Hive access
+ exec /usr/local/etc/java/moduce/jdk1.8.0_221/bin/java -Xmx20m -Dflume.root.logger=INFO,LOGFILE -cp '/usr/local/etc/hadoop/module/flume-1.7.0/lib/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/etc/hadoop:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/common/lib/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/common/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/hdfs:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/hdfs/lib/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/hdfs/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/yarn/lib/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/yarn/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/mapreduce/lib/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/mapreduce/*:/usr/local/etc/hadoop/module/hadoop-2.7.2/contrib/capacity-scheduler/*.jar:/lib/*' -Djava.library.path=:/usr/local/etc/hadoop/module/hadoop-2.7.2/lib/native org.apache.flume.node.Application --conf-file /usr/local/etc/hadoop/module/flume-1.7.0/conf/kafka-flume-hdfs.conf --name a1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/etc/hadoop/module/flume-1.7.0/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/etc/hadoop/module/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
20/04/09 11:12:48 ERROR node.Application: A fatal error occurred while running. Exception follows.
org.apache.commons.cli.ParseException: The specified configuration file does not exist: /usr/local/etc/hadoop/module/flume-1.7.0/conf/kafka-flume-hdfs.conf
        at org.apache.flume.node.Application.main(Application.java:316)

待解决!!!!
遇到问题,无法HDFs没有数据

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值