使用flume抽取日志数据使用flume拦截器将数据发送到kafka不同的topic中,使用flume作为kafka的消费者将不同topic中的数据使用LZO压缩方式将数据sink到hdfs中出现以下错误:
2020-01-30 22:04:11,378 (conf-file-poller-0) [WARN - org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:62)]
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-01-30 22:04:11,786 (conf-file-poller-0) [ERROR - org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:426)]
Sink k1 has been removed due to an error during configuration
java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
at org.apache.flume.sink.hdfs.HDFSEventSink.getCodec(HDFSEventSink.java:313)
at org.apache.flume.sink.hdfs.HDFSEventSink.configure(HDFSEventSink.java:237)
at org.apache.flume.conf.Configurables.configure(Configurables.java:41)
at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:411)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:102)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:141)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
经过排查发现flume配置问题:
## 组件
a1.sources = r1 r2
a1.channels = c1 c2
a1.sinks=k1 k2
## source1
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = bigdata01:9092,bigdata02:9092,bigdata03:9092
a1.sources.r1.kafka.topics=topic_start
## source2
a1.sources.r2.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r2.batchSize = 5000
a1.sources.r2.batchDurationMillis = 2000
a1.sources.r2.kafka.bootstrap.servers = bigdata01:9092,bigdata02:9092,bigdata03:9092
a1.sources.r2.kafka.topics=topic_event
## channel1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/modules/apache-flume-1.7.0-bin/checkpoint/behavior1
a1.channels.c1.dataDirs = /opt/modules/apache-flume-1.7.0-bin/data/behavior1/
a1.channels.c1.maxFileSize = 2146435071
a1.channels.c1.capacity = 1000000
a1.channels.c1.keep-alive = 6
## channel2
a1.channels.c2.type = file
a1.channels.c2.checkpointDir = /opt/modules/apache-flume-1.7.0-bin/checkpoint/behavior2
a1.channels.c2.dataDirs = /opt/modules/apache-flume-1.7.0-bin/data/behavior2/
a1.channels.c2.maxFileSize = 2146435071
a1.channels.c2.capacity = 1000000
a1.channels.c2.keep-alive = 6
## sink1
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://bigdata01:8020/origin_data/gmall/log/topic_start/%Y-%m-%d
a1.sinks.k1.hdfs.filePrefix = logstart-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = second
##sink2
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://bigdata01:8020/origin_data/gmall/log/topic_event/%Y-%m-%d
a1.sinks.k2.hdfs.filePrefix = logevent-
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.roundValue = 10
a1.sinks.k2.hdfs.roundUnit = second
## 不要产生大量小文件
a1.sinks.k1.hdfs.rollInterval = 10
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k2.hdfs.rollInterval = 10
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
## 控制输出文件是原生文件
a1.sinks.k1.hdfs.fileType = CompressedStream
a1.sinks.k2.hdfs.fileType = CompressedStream
#a1.sinks.k_openrec.hdfs.codeC=com.hadoop.compression.lzo.LzopCodec
#以下配置不能直接使用lzop,需要加上全类名
a1.sinks.k1.hdfs.codeC = com.hadoop.compression.lzo.LzopCodec
#-------
a1.sinks.k1.hdfs.writeFormat=Text
a1.sinks.k1.hdfs.callTimeout=360000
a1.sinks.k1.hdfs.maxIoWorkers=32
a1.sinks.k1.hdfs.fileSuffix=.lzo
a1.sinks.k2.hdfs.codeC = com.hadoop.compression.lzo.LzopCodec
#-------
a1.sinks.k2.hdfs.writeFormat=Text
a1.sinks.k2.hdfs.callTimeout=360000
a1.sinks.k2.hdfs.maxIoWorkers=32
a1.sinks.k2.hdfs.fileSuffix=.lzo
## 拼装
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sources.r2.channels = c2
a1.sinks.k2.channel = c2
安装lzop工具
# yum install lzop
# yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
然后将lzo的jar包copy到flume的lib目录下
将lzo jar包上传到hadoop如下路径
[johnny@bigdata01 ~]$ cd /opt/modules/hadoop-2.7.2/share/hadoop/common/
[johnny@bigdata01 common]$ ls
hadoop-common-2.7.2.jar hadoop-lzo-0.4.20.jar jdiff sources
hadoop-common-2.7.2-tests.jar hadoop-nfs-2.7.2.jar lib templates
在core-site.xml中添加配置
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
一定要将hadoop中的core-site.xml 文件copy到flume的conf目录下,并切同步给其他机器。
在etc/profile添加配置
#JAVA_HOME=/opt/modules/jdk1.7.0_79
export JAVA_HOME=/opt/modules/jdk1.8.0_231-amd64
export PATH=$PATH:$JAVA_HOME/bin
export MAVEN_HOME=/opt/modules/apache-maven-3.0.5
export PATH=$PATH:$MAVEN_HOME/bin
export LD_LIBRARY_PATH=/opt/modules/hadoop-2.7.2/lib/native
export CLASSPATH=$CLASSPATH:/opt/modules/hadoop-2.7.2/share/hadoop/common
#KAFKA_HOME
export KAFKA_HOME=/opt/modules/kafka_2.11-0.11.0.0/
export PATH=$PATH:$KAFKA_HOME/bin
此时运行flume发现问题解决之后出现以下空值问题:
2020-01-30 22:53:54,297 (SinkRunner-PollingRunner-DefaultSinkProcessor)
[ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.IllegalStateException: Empty value [channel=[channel=c2]]
at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
at org.apache.flume.channel.file.FlumeEventQueue.removeHead(FlumeEventQueue.java:159)
at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doTake(FileChannel.java:520)
at org.apache.flume.channel.BasicTransactionSemantics.take(BasicTransactionSemantics.java:113)
at org.apache.flume.channel.BasicChannelSemantics.take(BasicChannelSemantics.java:95)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:362)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
2020-01-30 22:53:54,316 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:158)] Unable to deliver event. Exception follows.
问题原因是因为需要先启动日志生成程序,在将flume2启动,问题解决。此时观察hdfs数据无误。