sparkStreaming+flume整合

练习sparkStreaming+flume整合踩了很久的坑,发现各种报错。

环境:linux上单机flume1.7.0,windows上单机spark2.2.0

主要总结如下:

1.source使用avro的时候,需要多台agent才行。

2.spark这边一直接受不到信息主要报错:

java.net.BindException: Cannot assign requested address: bind

org.apache.flume.FlumeException: NettyAvroRpcClient { host: 10.111.121.111, port: 9999 }: RPC connection error

原因:在配置flume的时候sink的输出端口,需要填写spark运行程序的主机和端口,而spark这边必须监听端口也必须使用程序运行的主机的ip+端口。本地主机是无法监听其他主机的端口数据,我就是一直监听的linux上的ip+端口,导致一直报错。

 

最终配置如下:

flume配置(source采集类型使用的spooldir,如果是avro的话,则需要再启动一个agent传输数据给此agent,输出端口一定要写spark运行的主机ip+自定义端口):

#向agent1采集实例中注册一个采集源名称sc1
agent1.sources = sc1
#向agent1采集实例中注册一个缓存频道ch1
agent1.channels = ch1
#向agent1采集实例中注册一个输出槽 sk1
  agent1.sinks = sk1
#以下为sc1采集源的配置
#采集源类型 spooldir
  agent1.sources.sc1.type = spooldir
#采集到的数据通过ch1缓存频道缓存
agent1.sources.sc1.channels = ch1
#采集文件夹
agent1.sources.sc1.spoolDir = /usr/local/test
#以下为ch1缓冲频道的配置
#缓冲频道类型 file
  agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 1000
agent1.channels.ch1.transactionCapacity = 100
#以下为输出槽设置
#输出槽类型为Avro
agent1.sinks.sk1.type = avro
#输出哪个缓存频道中的数据
agent1.sinks.sk1.channel = ch1
#输出到哪个主机
agent1.sinks.sk1.hostname = 10.111.121.111
#输出到哪个端口
agent1.sinks.sk1.port = 6666

#spark程序(官方例子)

val spark=SparkSession.builder().master("local[*]").appName("flumeSpark2").getOrCreate()
 val sc=spark.sparkContext
 val ssc = new StreamingContext(sc, Seconds(5))
 val flumeStream:ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc, "0.0.0.0", 6666)
 flumeStream.map(e => "Event:header:" + e.event.get(0).toString + "body: " + new String(e.event.getBody.array)).print()
 ssc.start()
 ssc.awaitTermination()

flume监控mysql+sparkStreaming(设定时间抓取mysql的最新数据,然后传入spark创建的端口,进行数据处理。)

agent1.sinks = k1
agent1.channels = c1
# Describe/configure the source
agent1.sources.r1.type = org.keedio.flume.source.SQLSource
agent1.sources.r1.hibernate.connection.url = jdbc:mysql://10.111.121.236:3306/test
# Hibernate Database connection properties
  agent1.sources.r1.hibernate.connection.user = root
agent1.sources.r1.hibernate.connection.password = cetcAdmin123
agent1.sources.r1.hibernate.connection.autocommit = true
agent1.sources.r1.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect
agent1.sources.r1.hibernate.connection.driver_class = com.mysql.jdbc.Driver
agent1.sources.r1.run.query.delay=1000
#agent1.sources.r1.table = test
#agent1.sources.r1.columns.to.select = *
#agent1.sources.r1.incremental.column.name = id
agent1.sources.r1.incremental.value = 1
agent1.sources.r1.status.file.path = /usr/local/flume_log
  agent1.sources.r1.status.file.name = sqlSource.status
# Custom query
  agent1.sources.r1.start.from = 19700000000000
agent1.sources.r1.custom.query = SELECT DATE_FORMAT(createTime, '%Y%m%d%H%i%s') as id_new,t.* FROM test t WHERE DATE_FORMAT(createTime, '%Y%m%d%H%i%s') > $@$ ORDER BY DATE
  _FORMAT(createTime, '%Y%m%d%H%i%s') ASC
agent1.sources.r1.batch.size = 1000
agent1.sources.r1.max.rows = 1000
agent1.sources.r1.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider
agent1.sources.r1.hibernate.c3p0.min_size=1
agent1.sources.r1.hibernate.c3p0.max_size=10

# Describe the sink
agent1.sinks.k1.type = avro
#输出哪个缓存频道中的数据
#agent1.sinks.k1.channel = c1
#输出到哪个主机
agent1.sinks.k1.hostname = 10.111.121.111
#输出到哪个端口
agent1.sinks.k1.port = 6666

# Use a channel which buffers events in memory
  agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
  agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值