Flume基础知识

最新推荐文章于 2024-06-13 09:52:40 发布

小萨_Joshua

最新推荐文章于 2024-06-13 09:52:40 发布

阅读量948

点赞数 23

文章标签： flume 大数据

本文链接：https://blog.csdn.net/weixin_45283159/article/details/136450854

版权

是 java框架，都跟jvm有关 (有空去了解jvm进程线程)
java框架启动都是启动一个jvm虚拟机
jps:java任务管理器，只展示jvm进程

分布式海量日志采集(只能传日志txt(log error))，聚合，传输框架

数据源：主要是 1.java后台日志数据 2.Python爬虫数据

上传区别
(1)常规上传:
put需要日志写完了，在上传
flume可以实时监控 本地磁盘/网络端口数据传到 hdfs/kafka
hdfs经典问题: 小文件会单独读，会起一个maptask去读 nn 都会浪费资源
(2)flume
1.flume可以控制一个范围一起上传到hdfs
2.hdfs文件不可以改，但是可以追加写，比如一直追加到128m在写下一个文件

基础架构
Agent(jvm进程)
1.source 上游比较快从磁盘读取，取决读取速度，一般每秒几十m，机械的话几百m,生成event
2.channel 连接上下游还可以起到缓冲的作用，会有一定容积，相当于 Arraylist
3.sink 下游本地还好，远程服务器就要看网络带宽

Channel:
1.如果速度不对等 channel会存，如果channel满了，可以控制source先不读取等方法
2.channel是线程安全的(多线程排队加锁实现线程安全:a先来锁上，b等着)
3.memory Channel 和 File Channel

Event 事件就是一个类，new一个Event对象
传输的时候传的是event对象
2个属性
Header (k，v) map
Body (byte) array

Flume安装目录:
java框架标准目录
bin:封装好的启动命令 java -jar xxx -D xxxxxx
conf配置文件参数变量传参
(log4j 日志打印的 root 日志打印等级 warn只有错误打印 info都打印 LOGFILE输出位置添加console会在控制台输出
log.dir 默认命令执行当前路径)
lib:写好的各种jar包

Agent原理
传递事件
1.Source 接收数据
2.Channel Processor 处理数据 --> 3.Interceptor 拦截器链
4.Channel Selector 选择器每个事件传入选择器返回写入事件列表(Replicating 副本机制多路复用 Multplexing配合拦截器使用，根据exent头信息，灵活分配)
5.根据选择器选择结果,把事件写入对应Channel(put真正传输，之前都是准备工作)
6.SinkProcessor (没意义–> Jvm是单独的) Sink组(3种，default 1对1,loadbalance sink轮巡去找，做负载均衡,failover高可用，故障转移(根据优先级，比如5 3 2挂了找下一个，挂掉的重启还会回到这个sink)),一个channel可以发给多个sink，不能多个channel发给一个sink

Flume 事务
事务 put事务 take事务
1234步整体是一件事就是事务，都成功才算完成，有不成功就回滚
source 推数据需要可重复读 nc不能重复读 txt就可以
sink 拉数据

AVRO端口轻量级rpc通信框架，发送接受数据
一般做
1.简单串联
2.复制，多路复用
3.负载均衡和故障转移多台channel 容量变高
4.聚合

其他:

案例一
监听端口，收集端口数据，打印到控制台
官网
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

配置文件 5部分组成
1.声明变量
a1 当前agent
a1.sources
a1.sinks
a1.channels
2.ip端口
类型(写死的) ip 端口
3.sink
logger 日志形式
4.channel
memory 内存
1000 指1000个event既1000条数据
100 事务最多100个event
5.source channel sink关系配置
1个sink只能接受一个channel
一个channel可以接受多个source
channel可以发给多个sink
body输出的数字是 16进制的

案例二
实时监控文件,写入HDFS

a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*

可以多目录监控

a1.sources.r1.positionFile = /var/log/flume/taildir_position.json

偏移量 taildir_position.json
json存储作为一个对象，里边是KV KV键值对，会保存每个文件的读取偏移量
断点续传：flume关闭以后，根据上次关闭时候偏移量，继续传输
vim编辑器原理是删除之前文件重新创建文件
echo >> 1.txt 是追加写

a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
官网搜
#hdfs.filePrefix	FlumeData	Name prefixed to files created by Flume in hdfs directory
#hdfs.fileSuffix	–	Suffix to append to file (eg .avro - NOTE: period is not automatically added)
#hdfs.inUsePrefix	–	Prefix that is used for temporal files that flume actively writes into
#hdfs.inUseSuffix	.tmp	Suffix that is used for temporal files that flume actively writes into

Prefix 前缀
Suffix 后缀
fileSuffix 已经追加写完的文件

官网搜
#inUseSuffix 正在追加写的文件  .tmp 临时文件
#hdfs.rollInterval	30	Number of seconds to wait before rolling current file (0 = never roll based on time interval)
#hdfs.rollSize	1024	File size to trigger roll, in bytes (0: never roll based on file size)
#hdfs.rollCount	10	Number of events written to file before it rolled (0 = never roll based on number of events)

追加写触发
30秒 -->1h 热度底的时候也要传输时间单位秒 3600
或者1k -->128M 单位byte 134217700
或者10个event -->0 不使用

官网搜
#hdfs.codeC	–	Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
#hdfs.fileType	SequenceFile	File format: currently SequenceFile, DataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream

压缩格式 hadoop io格式有序列化的改成datastream 就可以看懂了

官网搜
#hdfs.useLocalTimeStamp	false	Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.

使用本地时间戳
true就有时间了

常见问题启动时，不能加载文件
断点续传 flume启动会读取positionFile，如果内容为空，没有json，会报错，删除

分配配置
processor.selector failover故障转移 round_bobin 轮巡 random 随机

自定义拦截器Interceptor
如根据头信息(HashMap k-v) 决定数据发送到哪个channel
做数据过滤
type 默认 replicating
Multiplexing Channel Selector 配置文件1.指定type 为 multplexing 2.指定字段 3.字段枚举分配，可配置默认
java 注意TypeIntercepyor$Builder Builder 是真正创建的，容易忘记

小萨_Joshua

关注

23
点赞
踩
22

收藏

觉得还不错? 一键收藏
1
评论
Flume基础知识

6.SinkProcessor (没意义–> Jvm是单独的) Sink组(3种，default 1对1,loadbalance sink轮巡去找，做负载均衡,failover高可用，故障转移(根据优先级，比如5 3 2挂了找下一个，挂掉的重启还会回到这个sink)),一个channel可以发给多个sink，不能多个channel发给一个sink。断点续传 flume启动会读取positionFile，如果内容为空，没有json，会报错，删除。
复制链接

扫一扫