typora-root-url: images
打通组件
日志 --> flume --> kafka --> flume --> HDFS
生产数据
启动日志
事件日志
- 商品列表
- 商品点击
- 商品详情
- 评论
- 点赞
- 收藏
- 用户后台活跃
- 消息通知
- 广告
- 错误
生成数据jar包
链接:https://pan.baidu.com/s/1bsaagUX2xACH1o7vB3b-Hw
提取码:nz6d
将java包上传到/opt/module/sc_datas
[temp@hadoop102 /]$ mkdir -p /opt/module/sc_datas
[temp@hadoop102 /]$ cd /opt/module/sc_datas/
[temp@hadoop102 sc_datas]$ ll
总用量 1208
-rw-rw-r--. 1 temp temp 1233590 4月 1 17:49 log-create.jar
[temp@hadoop102 sc_datas]$
生成100条数据
#java -jar log-create.jar 生产每条数据后延时 生产数据条数
[temp@hadoop102 sc_datas]$ java -jar log-create.jar 0 100
默认生产的数据保存在 /tmp/logs/ 文件下
[temp@hadoop102 tmp]$ cd logs/
[temp@hadoop102 logs]$ ll
总用量 64
-rw-rw-r--. 1 temp temp 62689 4月 1 17:53 app-2021-04-01.log
[temp@hadoop102 logs]$ pwd
/tmp/logs
[temp@hadoop102 logs]$
flume 采集数据
使用hadoop102 hadoop103 采集数据/tmp/logs/ 下的日志
source:taidir 断点续传 监控多目录
channel:KafkaChannel(下游为Kafka)
ETL 数据清洗拦截器 类型区分拦截器
链接:https://pan.baidu.com/s/1QZjCldEqu82a2X6znpx6Gw
提取码:h99u
将flume-interceptor-1.0-SNAPSHOT.jar上传到/opt/module/flume/lib目录
[temp@hadoop102 lib]$ pwd
/opt/module/flume/lib
[temp@hadoop102 lib]$ ls | grep "flume-inter*"
flume-interceptor-1.0-SNAPSHOT.jar
[temp@hadoop102 lib]$
在/opt/module/flume/conf下创建file-flume-kafka.conf配置文件
[temp@hadoop102 conf]$ pwd
/opt/module/flume/conf
[temp@hadoop102 conf]$ vim file-flume-kafka.conf
#添加以下内容
a1.sources=r1
a1.channels=c1 c2
#source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/module/flume/checkpoint/log_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /tmp/logs/app.+
a1.sources.r1.fileHeader = true
a1.sources.r1.channels = c1 c2
#interceptor
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = com.at.flume.interceptor.LogETLInterceptor$Builder
a1.sources.r1.interceptors.i2.type = com.at.flume.interceptor.LogTypeInterceptor$Builder
#selector
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = topic
a1.sources.r1.selector.mapping.topic_start = c1
a1.sources.r1.selector.mapping.topic_event = c2
#KafkaChannel1
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.channels.c1.kafka.topic = topic_start
a1.channels.c1.parseAsFlumeEvent = false
a1.channels.c1.kafka.consumer.group.id = flume-consumer
#KafkaChannel2
a1.channels.c2.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.c2.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.channels.c2.kafka.topic = topic_event
a1.channels.c2.parseAsFlumeEvent = false
a1.channels.c2.kafka.consumer.group.id = flume-consumer
同步
[temp@hadoop102 conf]$ xsync /opt/module/flume/lib/
[temp@hadoop102 conf]$ xsync /opt/module/flume/conf/
flume-hdfs
只需要在hadoop104上部署即可
flum1 flum2
[temp@hadoop104 conf]$ pwd
/opt/module/flume/conf
[temp@hadoop104 conf]$ vim eventkafka-flume-hdfs.conf
#添加如下内容
a2.sources=r2
a2.channels=c2
a2.sinks=k2
# source
a2.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a2.sources.r2.batchSize = 5000
a2.sources.r2.batchDurationMillis = 2000
a2.sources.r2.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a2.sources.r2.kafka.topics=topic_event
## channel
a2.channels.c2.type = file
a2.channels.c2.checkpointDir = /opt/module/flume/checkpoint/behavior2
a2.channels.c2.dataDirs = /opt/module/flume/data/behavior2/
a2.channels.c2.maxFileSize = 2146435071
a2.channels.c2.capacity = 1000000
a2.channels.c2.keep-alive = 6
# sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = /origin_data/mall/log/topic_event/%Y-%m-%d
a2.sinks.k2.hdfs.filePrefix = logevent-
a2.sources.r2.channels = c2
a2.sinks.k2.channel= c2
[temp@hadoop104 conf]$ vim startkafka-flume-hdfs.conf
#添加如下内容
a1.sources=r1
a1.channels=c1
a1.sinks=k1
# source
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sources.r1.kafka.topics=topic_start
# channel
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/module/flume/data/behavior1/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
# sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /origin_data/mall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
a1.sources.r1.channels = c1
a1.sinks.k1.channel= c1
启动测试
启动zk
[temp@hadoop102 flume]$ zk.sh start
启动kafka
[temp@hadoop102 flume]$ kf.sh start
启动kafka-eagle
/opt/module/kafka-eagle-web-1.3.7/bin/ke.sh start
http://192.168.170.102:8048/ke/
启动hadoop102 hadoop103 flume采集
/opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/file-flume-kafka.conf --name a1 -Dflume.root.logger=INFO,LOGFILE
[temp@hadoop102 logs]$ vim /home/temp/bin/f1.sh
#! /bin/bash
case $1 in
"start"){
for i in hadoop102 hadoop103
do
echo " --------启动 $i 采集flume-------"
ssh $i "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/file-flume-kafka.conf --name a1 -Dflume.root.logger=INFO,LOGFILE >/opt/module/flume/logs 2>&1 &"
done
};;
"stop"){
for i in hadoop102 hadoop103
do
echo " --------停止 $i 采集flume-------"
ssh $i "ps -ef | grep file-flume-kafka | grep -v grep |awk '{print \$2}' | xargs -n1 kill -9 "
done
};;
esac
启动Ganglia
sudo service httpd start
sudo service gmetad start
sudo service gmond start
http://192.168.170.102/ganglia
启动hdfs yarn
[temp@hadoop102 module]$ start-dfs.sh
[temp@hadoop102 module]$ start-yarn.sh
启动hadoop104 flume消费
/opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/eventkafka-flume-hdfs.conf --name a2 -Dflume.root.logger=INFO,LOGFILE
/opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/startkafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE
[temp@hadoop102 logs]$ vim /home/temp/bin/f2.sh
#!/bin/bash
case $1 in
"start"){
echo "=============== 104 flume start ================="
ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/startkafka-flume-hdfs.conf --name a1 -Dflume.root.logger=INFO,LOGFILE > /opt/module/flume/logs/flume-1.log 2>&1 & "
ssh hadoop104 "nohup /opt/module/flume/bin/flume-ng agent --conf-file /opt/module/flume/conf/eventkafka-flume-hdfs.conf --name a2 -Dflume.root.logger=INFO,LOGFILE > /opt/module/flume/logs/flume-2.log 2>&1 & "
};;
"stop"){
echo "=============== 104 flume stop ================="
ssh hadoop104 "ps -ef | grep startkafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
ssh hadoop104 "ps -ef | grep eventkafka-flume-hdfs | grep -v grep |awk '{print \$2}' | xargs -n1 kill"
};;
esac
[temp@hadoop102 logs]$ xcall.sh jps
================ hadoop102 ===================
36113 QuorumPeerMain
36369 NameNode
36755 JournalNode
43123 Bootstrap
43478 Jps
36504 DataNode
37401 ResourceManager
36986 DFSZKFailoverController
38074 Application
38589 Kafka
37550 NodeManager
================ hadoop103 ===================
44307 QuorumPeerMain
44467 NameNode
45558 Application
44568 DataNode
44968 ResourceManager
46088 Kafka
45081 NodeManager
44827 DFSZKFailoverController
54044 Jps
44685 JournalNode
================ hadoop104 ===================
38401 Kafka
44754 Application
50212 Jps
37271 NameNode
37370 DataNode
40218 ConsoleConsumer
37484 JournalNode
37628 DFSZKFailoverController
37758 NodeManager
48206 Application
37135 QuorumPeerMain
[temp@hadoop102 logs]$
生成日志脚本
[temp@hadoop102 logs]$ vim /home/temp/bin/lg.sh
#! /bin/bash
for i in hadoop102 hadoop103
do
ssh $i "java -jar /opt/module/sc_datas/log-create.jar $1 $2 >/dev/null 2>&1 &"
done
执行日志脚本
lg.sh
hadoop104 flume 出错
Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:221)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:572)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:412)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
解决办法:将lib文件夹下的guava-11.0.2.jar删除以兼容Hadoop 3.1.3
rm /opt/module/flume/lib/guava-11.0.2.jar
重新启动hadoop104 flume消费
查看结果 http://hadoop102:9870/explorer.html#/
小文件巨多
-
har 归档
-
flume 上传时指定日志文件保存格式
- 时间 大小 event个数
-
压缩
不要产生大量小文件
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0控制输出文件是原生文件。
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.codeC = lzop不要产生大量小文件
a2.sinks.k2.hdfs.rollInterval = 10
a2.sinks.k2.hdfs.rollSize = 134217728
a2.sinks.k2.hdfs.rollCount = 0控制输出文件是原生文件。
a2.sinks.k2.hdfs.fileType = CompressedStream
a2.sinks.k2.hdfs.codeC = lzop
配置hadoop lzo压缩 为业务数据准备
[temp@hadoop102 logs]$ vim /opt/module/hadoop-3.1.3/etc/hadoop/core-site.xml
#添加以下内容
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
[temp@hadoop102 logs]$ cp /opt/software/hadoop-lzo-0.4.20.jar /opt/module/hadoop-3.1.3/share/hadoop/common/
[temp@hadoop102 logs]$ xsync /opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar
[temp@hadoop102 logs]$ xsync /opt/module/hadoop-3.1.3/etc/hadoop/
群起脚本
[temp@hadoop102 bin]$ vim cluster.sh
#!/bin/bash
case $1 in
"start"){
echo "================ cluster start ================"
#zk
zk.sh start
sleep 2s;
#namenode yarn
/opt/module/hadoop-3.1.3/sbin/start-dfs.sh
/opt/module/hadoop-3.1.3/sbin/start-yarn.sh
#102 103 flume 采集
f1.sh start
#kafka
kf.sh start
sleep 3s;
#104 flume 消费
f2.sh start
};;
"stop"){
echo "================ cluster stop ================"
f2.sh stop
kf.sh stop
sleep 5s;
f1.sh stop
/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh
/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh
zk.sh stop
};;
esac
至此 logfile -> flume -> kafka -> flume -> hdfs 这条通道畅通
MySQL -> HDFS
sqoop
http://mirrors.hust.edu.cn/apache/sqoop/1.4.6/
解压
[temp@hadoop102 software]$ tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/module/
[temp@hadoop102 module]$ mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop
修改配置文件
[temp@hadoop102 conf]$ pwd
/opt/module/sqoop/conf
[temp@hadoop102 conf]$ mv sqoop-env-template.sh sqoop-env.sh
[temp@hadoop102 conf]$ vim sqoop-env.sh
#添加如下内容
export HADOOP_COMMON_HOME=/opt/module/hadoop-3.1.3
export HADOOP_MAPRED_HOME=/opt/module/hadoop-3.1.3
export ZOOKEEPER_HOME=/opt/module/zookeeper
export ZOOCFGDIR=/opt/module/zookeeper/conf
在/opt/module/sqoop/lib/中添加JDBC驱动
检验连接数据库
[temp@hadoop102 sqoop]$ bin/sqoop list-databases --connect jdbc:mysql://hadoop102:3306/ --username root --password root
#连接成功会显示mysql中的数据库
建表
sc_mall_db.sql
导入数据
mysql_to_hdfs.sh
链接:https://pan.baidu.com/s/1dTWx2lEWRYvpsrXi1W2aDQ
提取码:gupr