为什么不直接把客户端的数据保存到HDFS?
服务端做统一的数据过滤处理比较方便(也比较规范),如果客户端量比较大,中间还可以用kafka队列做消峰处理,然后服务端再从kafka获取数据存储到HDFS
1、CDH安装flume(web也没直接安装),集成HDFS,Hbase
修改配置:Agent 的 Java 堆栈大小(字节) 1G
HDFS目录创建(数据目录):master60下执行
hadoop fs -mkdir /flume
hadoop fs -chmod 777 /flume
2、master60、node61、node62的Flume配置(页面直接配置,选中实例配置,不能直接整个集群一起配置)
master60配置修改
修改代理名称:agent1
修改配置文件:
#配置文件:replicate_sink1_case11.conf
# Name the components on this agent
#配置agent包含的各个组件
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
# Describe/configure the source
agent1.sources.source1.type =avro
agent1.sources.source1.bind =master60
agent1.sources.source1.port =50000
# Describe the sink
#配置 sink1
agent1.sinks.sink1.type=hdfs
agent1.sinks.sink1.hdfs.path=hdfs://master60/flume/%y-%m-%d
agent1.sinks.sink1.hdfs.filePrefix=tomcat
agent1.sinks.sink1.hdfs.fileType=DataStream
agent1.sinks.sink1.hdfs.useLocalTimeStamp=true
agent1.sinks.sink1.hdfs.round=true
#1小时
agent1.sinks.sink1.hdfs.roundValue=1
agent1.sinks.sink1.hdfs.roundUnit=hour
#256M
agent1.sinks.sink1.hdfs.rollSize=268435456
agent1.sinks.sink1.hdfs.rollCount = 0
#1小时
agent1.sinks.sink1.hdfs.rollInterval = 3600
# Use a channel which buffers events inmemory
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 1000
agent1.channels.channel1.transactionCapacity = 100
#将source、sink与channel绑定
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
node61配置修改