采集样例1

文章介绍了如何使用Flume从CSV文件中读取数据,过滤特定行,然后通过Flink流处理进行数据转换,过滤特定值,最后将处理后的数据写入Kafka并进一步存储到Redis中。涉及到了Flume配置、Kafka消费者和生产者以及Flink的使用。
摘要由CSDN通过智能技术生成

运行data_log下的make_data_file_v1

./make_data_file_v1


flume下的job新建t5.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f +0 /data_log/2024-01-05@23:27-producerecord.csv
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = ProduceRecord
a1.sinks.k1.kafka.bootstrap.servers = bigdata1:9092,bigdata2:9092,bigdata3:9092
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

a1.sources.r1.command = tail -f +0 /data_log/2024-01-05@23:27-producerecord.csv

此处写自己data_log下的名字

先开启flume 在虚拟机执行刚刚写入的那个脚本

创建生产者

flume-ng agent -c conf/ -n a1 -f /opt/module/flume-1.9.0/job/t5.conf -Dflume.root.logger=INFO,console

查看 ProduceRecord中存在的数据

创建消费者 消费数据

kafka-console-consumer.sh --bootstrap-server bigdata1:9092,bigdata2:9092,bigdata3:9092 --topic ProduceRecord --from-beginning

  • 因此处出现“==> /data_log/2024-01-05@23:27-producerecord.csv <==”,在代码中用filter将这行过滤掉。
  • 此处在执行数出过错,但未删除重新执行,直接将“hello wrold” 、“==> /data_log/2024-01-10@09_30-producerecord.csv <==”和“==> /data_log/2024-01-05@23:27-producerecord.csv <==”过滤掉即可成功。

编写代码

创建kafka生产者

val properties:Properties=new Properties

给kafka生产者添加配置信息

        (1)指定连接地址

properties.setProperty("bootstrap.servers", "bigdata1:9092,bigdata2:9092,bigdata3:9092")

        (2)序列化消息键的类,意味着消息键将被序列化为字符串格式。

properties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")

        (3)反序列化消息键的类

properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringSerializer")

        (4)反序列化消息值的

properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")

        (5)如果消费者在主题中没有找到初始偏移量,它将自动从主题的最早消息开始消费

properties.setProperty("auto.offset.reset", "earliest")

定义一个Flink流处理环境,设置并行度为1,处理时间进行计算

val env:StreamExecutionEnvironment= StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)

从名为 "ProduceRecord" 的 Kafka 主题中读取字符串数据

val kafkaStream: DataStream[String] = env.addSource(new FlinkKafkaConsumer[String]("ProduceRecord", new SimpleStringSchema(), properties))

kafkaStream 中的每行数据拆分为一个数组,并从中提取第1和第10个元素,然后创建一个新的整数对,并将该对作为新的数据流 dateStream 的元素

val dateStream = kafkaStream.filter(_!="hello world")
  .filter(_!="==> /data_log/2024-01-10@09_30-producerecord.csv <==")
  .filter(_!="==> /data_log/2024-01-05@23:27-producerecord.csv <==")
  .map(line => {
    val data = line.split(",")
    (data(0).toInt, data(9).toInt)
  })

连接redis数据库

val config: FlinkJedisPoolConfig = new FlinkJedisPoolConfig.Builder()  

.setHost("bigdata1")  

.setPort(6379)  

.build()

创建redis对象,并将数据写入

val redisSink = new RedisSink[(Int, Int)](config, new MyRedisMapper)

发送

dateStream.addSink(redisSink)

执行Flink程序

env.execute("FlinkToKafkaToRedis")

全部代码

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
import java.util.Properties

object g1 {
  def main(args: Array[String]): Unit = {
    //创建kafka生产者的配置对象
    val properties:Properties=new Properties
    //给kafka生产者的配置对象添加配置信息:bootstrap。servers
    properties.setProperty("bootstrap.servers", "bigdata1:9092,bigdata2:9092,bigdata3:9092")
    //key,value序列化
    properties.setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringSerializer")
    properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("auto.offset.reset", "earliest")

    val env:StreamExecutionEnvironment= StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)

    //使用flink对数据进行处理
    val kafkaStream: DataStream[String] = env.addSource(new FlinkKafkaConsumer[String]("ProduceRecord", new SimpleStringSchema(), properties))

    //使用flink算子对数据进行处理
    val dateStream = kafkaStream.filter(_!="hello world")
      .filter(_!="==> /data_log/2024-01-10@09_30-producerecord.csv <==")
      .filter(_!="==> /data_log/2024-01-05@23:27-producerecord.csv <==")
      .map(line => {
        val data = line.split(",")
        (data(0).toInt, data(10).toInt)
      })
.filter(_._2 == 1)
      .keyBy(_._1)
      .timeWindow(Time.minutes(1))
      .sum(1)
    //打印做测试
    dateStream.print("ds")
//连接Redis数据库的配置
val config: FlinkJedisPoolConfig = new FlinkJedisPoolConfig.Builder()
  .setHost("bigdata1")
  .setPort(6379)
  .build()

    // 创建RedisSink对象,并将数据写入Redis
    val redisSink = new RedisSink[(Int, Int)](config, new MyRedisMapper)

    // 发送数据
    dateStream.addSink(redisSink)

    //执行Flink程序
    env.execute("FlinkToKafkaToRedis")

  }

  //    根据题目要求
  class MyRedisMapper extends RedisMapper[(Int, Int)] {
    //这里使用RedisCommand.HSET不用RedisCommand.SET,前者创建RedisHash表后者创建Redis普通的String对应表
    override def getCommandDescription: RedisCommandDescription = new RedisCommandDescription(RedisCommand.HSET,
      "totalproduce")

    override def getKeyFromData(t: (Int, Int)): String = t._1 + ""

    override def getValueFromData(t: (Int, Int)): String = t._2 + ""

  }
}

  • 19
    点赞
  • 21
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值