运行data_log下的make_data_file_v1
./make_data_file_v1
flume下的job新建t5.conf
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -f +0 /data_log/2024-01-05@23:27-producerecord.csv a1.sources.r1.channels = c1 # Describe the sink a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.kafka.topic = ProduceRecord a1.sinks.k1.kafka.bootstrap.servers = bigdata1:9092,bigdata2:9092,bigdata3:9092 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 100000 a1.channels.c1.transactionCapacity = 1000 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
a1.sources.r1.command = tail -f +0 /data_log/2024-01-05@23:27-producerecord.csv
此处写自己data_log下的名字
先开启flume 在虚拟机执行刚刚写入的那个脚本
创建生产者
flume-ng agent -c conf/ -n a1 -f /opt/module/flume-1.9.0/job/t5.conf -Dflume.root.logger=INFO,console
查看 ProduceRecord中存在的数据
创建消费者 消费数据
kafka-console-consumer.sh --bootstrap-server bigdata1:9092,bigdata2:9092,bigdata3:9092 --topic ProduceRecord --from-beginning
- 因此处出现“==> /data_log/2024-01-05@23:27-producerecord.csv <==”,在代码中用filter将这行过滤掉。
- 此处在执行数出过错,但未删除重新执行,直接将“hello wrold” 、“==> /data_log/2024-01-10@09_30-producerecord.csv <==”和“==> /data_log/2024-01-05@23:27-producerecord.csv <==”过滤掉即可成功。
编写代码
创建kafka生产者
val properties:Properties=new Properties
给kafka生产者添加配置信息
(1)指定连接地址
properties.setProperty("bootstrap.servers", "bigdata1:9092,bigdata2:9092,bigdata3:9092")
(2)序列化消息键的类,意味着消息键将被序列化为字符串格式。
properties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
(3)反序列化消息键的类
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringSerializer")
(4)反序列化消息值的
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
(5)如果消费者在主题中没有找到初始偏移量,它将自动从主题的最早消息开始消费
properties.setProperty("auto.offset.reset", "earliest")
定义一个Flink流处理环境,设置并行度为1,处理时间进行计算
val env:StreamExecutionEnvironment= StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
从名为 "ProduceRecord" 的 Kafka 主题中读取字符串数据
val kafkaStream: DataStream[String] = env.addSource(new FlinkKafkaConsumer[String]("ProduceRecord", new SimpleStringSchema(), properties))
kafkaStream 中的每行数据拆分为一个数组,并从中提取第1和第10个元素,然后创建一个新的整数对,并将该对作为新的数据流 dateStream 的元素
val dateStream = kafkaStream.filter(_!="hello world")
.filter(_!="==> /data_log/2024-01-10@09_30-producerecord.csv <==")
.filter(_!="==> /data_log/2024-01-05@23:27-producerecord.csv <==")
.map(line => {
val data = line.split(",")
(data(0).toInt, data(9).toInt)
})
连接redis数据库
val config: FlinkJedisPoolConfig = new FlinkJedisPoolConfig.Builder()
.setHost("bigdata1")
.setPort(6379)
.build()
创建redis对象,并将数据写入
val redisSink = new RedisSink[(Int, Int)](config, new MyRedisMapper)
发送
dateStream.addSink(redisSink)
执行Flink程序
env.execute("FlinkToKafkaToRedis")
全部代码
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
import java.util.Properties
object g1 {
def main(args: Array[String]): Unit = {
//创建kafka生产者的配置对象
val properties:Properties=new Properties
//给kafka生产者的配置对象添加配置信息:bootstrap。servers
properties.setProperty("bootstrap.servers", "bigdata1:9092,bigdata2:9092,bigdata3:9092")
//key,value序列化
properties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("auto.offset.reset", "earliest")
val env:StreamExecutionEnvironment= StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
//使用flink对数据进行处理
val kafkaStream: DataStream[String] = env.addSource(new FlinkKafkaConsumer[String]("ProduceRecord", new SimpleStringSchema(), properties))
//使用flink算子对数据进行处理
val dateStream = kafkaStream.filter(_!="hello world")
.filter(_!="==> /data_log/2024-01-10@09_30-producerecord.csv <==")
.filter(_!="==> /data_log/2024-01-05@23:27-producerecord.csv <==")
.map(line => {
val data = line.split(",")
(data(0).toInt, data(10).toInt)
})
.filter(_._2 == 1)
.keyBy(_._1)
.timeWindow(Time.minutes(1))
.sum(1)
//打印做测试
dateStream.print("ds")
//连接Redis数据库的配置
val config: FlinkJedisPoolConfig = new FlinkJedisPoolConfig.Builder()
.setHost("bigdata1")
.setPort(6379)
.build()
// 创建RedisSink对象,并将数据写入Redis
val redisSink = new RedisSink[(Int, Int)](config, new MyRedisMapper)
// 发送数据
dateStream.addSink(redisSink)
//执行Flink程序
env.execute("FlinkToKafkaToRedis")
}
// 根据题目要求
class MyRedisMapper extends RedisMapper[(Int, Int)] {
//这里使用RedisCommand.HSET不用RedisCommand.SET,前者创建RedisHash表后者创建Redis普通的String对应表
override def getCommandDescription: RedisCommandDescription = new RedisCommandDescription(RedisCommand.HSET,
"totalproduce")
override def getKeyFromData(t: (Int, Int)): String = t._1 + ""
override def getValueFromData(t: (Int, Int)): String = t._2 + ""
}
}