Flume 断点续传解决方案

最新推荐文章于 2024-04-24 21:57:02 发布

weixin_39574120

最新推荐文章于 2024-04-24 21:57:02 发布

阅读量711

点赞数

分类专栏：大数据文章标签：大数据 flume

本文链接：https://blog.csdn.net/weixin_39574120/article/details/103793290

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Flume 断点续传解决方案

采集源，即source——监控文件内容更新 : exec ‘tail -F file’时会遇到一个问题，当flume宕机时，监测的文件还在不断地更新，此时如果Flume重启就会出现数据丢失的情况。以下是我在遇到这个问题时的解决方案，从最初的V1 -> V2 -> V3在不改变源码的情况下一步一步改进。

Version1

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
# 定义Source源
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /inputs/flume_monitor/test
a1.sources.r1.fileHeader = true


# Describe the sink
# Sink 到Kafka中
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = test_topic
a1.sinks.k1.brokerList= bigdata112:9092,bigdata113:9092,bigdata114:9092
a1.sinks.k1.batchSize = 20
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.producer.linger.ms = 1


# Use a channel which buffers events in memory
# Channel 为内存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

tail -F file : 等同于–follow=name --retry，根据文件名进行追踪，并保持重试，即该文件被删除或改名后，如果再次创建相同的文件名，会继续追踪。

但是tail -F 命令没有记录Flume已上传的offset，如果Flume宕机的同时文件有新内容加入，则新内容会被丢失。为了解决这个问题，Flume在Version 1.7.0中引入了Taildir Source, 以解决断点续传问题。

Version 2 (Taildir Source)

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# https://blog.csdn.net/Abysscarry/article/details/89420560
# Describe/configure the source
a1.sources.r1.type = TAILDIR
# 记录文件已采集的offset
a1.sources.r1.positionFile =  /opt/module/flume-1.8.0/log/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /inputs/flume_monitor/test
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = test_topic
a1.sinks.k1.brokerList= bigdata112:9092,bigdata113:9092,bigdata114:9092
a1.sinks.k1.batchSize = 20
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.producer.linger.ms = 1


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

什么是Taildir Source 呢，官网的定义如下：

Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files. If the new lines are being written, this source will retry reading them in wait for the completion of the write.

This source is reliable and will not miss data even when the tailing files rotate. It periodically writes the last read position of each files on the given position file in JSON format. If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file.

In other use case, this source can also start tailing from the arbitrary position for each files using the given position file. When there is no position file on the specified path, it will start tailing from the first line of each files by default.

Files will be consumed in order of their modification time. File with the oldest modification time will be consumed first.

This source does not rename or delete or do any modifications to the file being tailed. Currently this source does not support tailing binary files. It reads text files line by line.

翻译如下
观察指定的文件，并在检测到新行被添加到每个文件后能几乎实时地tail它们。如果正在写入新行，则此source将重试读取它们以等待写入完成。

此source是可靠的，即使tail的文件轮替也不会丢失数据。它定期以JSON格式写入给定位置文件上每个文件的最后读取位置。如果Flume由于某种原因stop或down，它可以从文件position处重新开始tail。

在其他用法中，此source也可以通过给定的position文件从每个文件的任意位置开始tail。当指定路径上没有position文件时，默认情况下它将从每个文件的第一行开始tail。

文件将按修改时间顺序使用。将首先使用具有最早修改时间的文件。

此source不会重命名或删除或对正在tail的文件执行任何修改。目前此source不支持tail二进制文件。它只能逐行读取文本文件。
————————————————
版权声明：本文为CSDN博主「深寒丶」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/Abysscarry/article/details/89420560

当Flume开始监听某个文件时，会将已经采集完成的offset保存到 a1.sources.r1.positionFile文件中。

该文件大致为**[{“inode”:2281677,“pos”:92187,“file”:"/inputs/flume_monitor/gps"}]**，其中inode 为Linux中文件唯一标识码，pos为offset，file文件名。

虽然这个能解决一部分断点续传问题，但是经实验发现，flume的channel 采用memory依然会丢失一部分数据。所以要想实现高可靠的flume采集系统，最终采用了Version 3 使用磁盘存放channel的数据。

Version 3 （使用磁盘存放channel 的数据）

a1.sources = r1
a1.sinks = k1
a1.channels = c1

# https://blog.csdn.net/Abysscarry/article/details/89420560
# Describe/configure the source
a1.sources.r1.type = TAILDIR
# 记录文件已采集的offset
a1.sources.r1.positionFile =  /opt/module/flume-1.8.0/log/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /inputs/flume_monitor/test
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = test_topic
a1.sinks.k1.brokerList= bigdata112:9092,bigdata113:9092,bigdata114:9092
a1.sinks.k1.batchSize = 20
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.producer.linger.ms = 1

# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /tmp/flume/checkpoint
a1.channels.c1.dataDirs = /tmp/flume/data
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

测试代码

数据生产者

#!/bin/bash

outputpath=/inputs/flume_monitor/gps

echo "outputpath:"$outputpath
for i in {1..1000}
do
   echo "Welcome $i times"
   echo "Welcome $i times" >> $outputpath
   sleep 0.5
done

数据消费者Kafka

import kafka.utils.ShutdownableThread;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.log4j.Logger;

import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.io.PrintWriter;
import java.time.Duration;
import java.util.Collections;
import java.util.Properties;

public class Consumer extends ShutdownableThread {
    private Logger LOG = Logger.getLogger(this.getClass());
    private final KafkaConsumer<Integer, String> consumer;
    private final String topic;

    private FileOutputStream file = null;
    private BufferedOutputStream out = null;
    private PrintWriter printWriter = null;
    private String lineSeparator = null;
    private int batchNum = 0;


    public Consumer(String topic, String groupId) {
        super("KafkaConsumerExample", false);
        Properties props = new Properties();

        // 修改成自己的kafka server地址
        props.put("bootstrap.servers", "bigdata112:9092,bigdata113:9092,bigdata114:9092");
        props.put("group.id", groupId);
        props.put("enable.auto.commit", "true");
        props.put("auto.offset.reset", "earliest");
        props.put("session.timeout.ms", "30000");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        consumer = new KafkaConsumer<>(props);
        System.out.println("consumer:"+consumer);
        this.topic = topic;
    }

    @Override
    public void doWork() {

        consumer.subscribe(Collections.singletonList(this.topic));
        ConsumerRecords<Integer, String> records = consumer.poll(Duration.ofSeconds(1));
        System.out.println("消费到消息数:" + records.count());
        if (records.count() > 0) {
            for (ConsumerRecord<Integer, String> record : records) {
                LOG.warn("Received message: (" + record.key() + ", " + record.value() + ") at offset " + record.offset());
                String value = record.value();
            }
        }
    }

    /**
     * 写入消息log
     *
     * @param msg 从kafka消费来的消息
     */
    protected void writeLog(String msg) {
        printWriter.write(msg + lineSeparator);
    }

    @Override
    public String name() {
        return null;
    }

    @Override
    public boolean isInterruptible() {
        return false;
    }

    public static void main(String[] args) {
        //kafka主题
        String topic = "test_topic";
        //消费组id
        String groupId = "test001";

        Consumer consumer = new Consumer(topic, groupId);
        consumer.start();
    }
}

weixin_39574120

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flume 断点续传解决方案

Flume 断点续传解决方案采集源，即source——监控文件内容更新 : exec ‘tail -F file’时会遇到一个问题，当flume宕机时，监测的文件还在不断地更新，此时如果Flume重启就会出现数据丢失的情况。以下是我在遇到这个问题时的解决方案，从最初的V1 -> V2 -> V3在不改变源码的情况下一步一步改进。Version1a1.sources = r1...
复制链接

扫一扫