实时数据采集(电商)
第一题
1、 在主节点使用Flume采集实时数据生成器10050端口的socket数据(实时数据生成器脚本为主节点/data_log目录下的dj_data_gen脚本,该脚本为主节点本地部署且使用socket传输),将数据存入到Kafka的Topic中(Topic名称为order,分区数为4),使用Kafka自带的消费者消费order(Topic)中的数据,将前2条数据的结果截图粘贴至客户端桌面【Release\任务D提交结果.docx】中对应的任务序号下;
注:需先启动已配置好的Flume再启动脚本,否则脚本将无法成功启动,启动方式为进入/data_log目录执行./dj_data_gen(如果没有权限,请执行授权命令chmod 777 /data_log/dj_data_gen)
1.配置flume.conf文件
vi /opt/module/flume/job/flume.conf
# 定义 Flume agent 中的 sources、sinks 和 channels a1.sources = r1 a1.sinks = k1 a1.channels = c1 # 定义源 (Source) r1,类型为 Avro a1.sources.r1.type = avro a1.sources.r1.bind = localhost a1.sources.r1.port = 10050 # 定义汇 (Sink) k1,类型为 KafkaSink a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.bootstrap.servers = master:9092 a1.sinks.k1.kafka.topic = order a1.sinks.k1.kafka.producer.partitions = 4 # 定义通道 (Channel) c1,类型为 memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # 连接到 Channel c1 a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
2.启动flume
flume-ng agent -c conf/ -n a1 -f /opt/module/flume/job/flume.conf -Dflume.root.logger=INFO,console
3.启动脚本
cd /data_log ./dj_data_gen
4.启动kafka消费topic 并取2条
kafka-console-consumer.sh --bootstrap-server master:9092 --topic order --from-beginning --max-messages 2
第二题
采用多路复用模式,Flume接收数据注入kafka 的同时,将数据备份到HDFS目录/user/test/flumebackup下,将查看备份目录下的第一个文件的前2条数据的命令与结果截图粘贴至客户端桌面【Release\任务D提交结果.docx】中对应的任务序号下。
1.配置flume-kafka.conf
a1.sources = s1 a1.channels = c1 c2 a1.sinks = k1 k2 a1.sources.s1.type = avro a1.sources.s1.bind = localhost a1.sources.s1.port = 10050 a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.kafka.topic = hunter a1.sinks.k1.kafka.bootstrp.servers = master:9092 a1.sinks.k2.type = hdfs a1.sinks.k2.hdfs.path = /user/test/flumebackup a1.sinks.k2.hdfs.filePrefix = backup- a1.sinks.k2.hdfs.writeFormat = Text a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.channels.c2.type = file a1.sources.s1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
2.启动flume
flume-ng agent -n a1 -c conf -f /opt/module/flume/job/flume-kafka.conf -Dflume.root.logger=INFO,console
3.启动脚本
cd /data_log ./dj_data_gen
4.查看备份目录下的第一个文件的前2条数据
hdfs dfs -cat /user/test/flumebackup/backup-0.txt | head -n 2
实时数据采集(工业)
第一题
1、 在主节点使用Flume采集/data_log目录下实时日志文件中的数据(实时日志文件生成方式为在/data_log目录下执行./make_data_file_v1命令即可生成日志文件,如果没有权限,请执行授权命令chmod 777 /data_log/make_data_file_v1),将数据存入到Kafka的Topic中(Topic名称分别为ChangeRecord、ProduceRecord和EnvironmentData,分区数为4),将Flume采集ChangeRecord主题的配置截图粘贴至客户端桌面【Release\任务D提交结果.docx】中对应的任务序号下;
配置gy_to_kafka.conf
a1.sources = r1 a1.sinks = k1 a1.channels = c1 a1.sources.r1.type = exec a1.sources.r1.command = tail -F /data_log/2024-01-09@15:27-changerecord.csv a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.kafka.bootstrap.servers = master:9092 a1.sinks.k1.kafka.topic = ChangeRecord a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.sources.r1.channels = c1 a1.sinks.k1.channels = c1
启动flume
flume-ng agent -n a1 -c conf/ -f /opt/module/flume/job/gy_to_kafka.conf -Dflume.root.logger=INFO,console
启动脚本
进入/data_log路径运行脚本make_data_file_v1产生数据源
cd /data_log ./make_data_file_v1
在另一个终端窗口中,启动 Kafka 相关主题:
bin/kafka-topics.sh --create --bootstrap-server master:9092 --replication-factor 1 --partitions 4 --topic ChangeRecord
bin/kafka-topics.sh --create --bootstrap-server master:9092 --replication-factor 1 --partitions 4 --topic ProduceRecord
bin/kafka-topics.sh --create --bootstrap-server master:9092 --replication-factor 1 --partitions 4 --topic EnvironmentData
列出Topic
bin/kafka-topics.sh --list --bootstrap-server master:9092