练习sparkStreaming+flume整合踩了很久的坑,发现各种报错。
环境:linux上单机flume1.7.0,windows上单机spark2.2.0
主要总结如下:
1.source使用avro的时候,需要多台agent才行。
2.spark这边一直接受不到信息主要报错:
java.net.BindException: Cannot assign requested address: bind
org.apache.flume.FlumeException: NettyAvroRpcClient { host: 10.111.121.111, port: 9999 }: RPC connection error
原因:在配置flume的时候sink的输出端口,需要填写spark运行程序的主机和端口,而spark这边必须监听端口也必须使用程序运行的主机的ip+端口。本地主机是无法监听其他主机的端口数据,我就是一直监听的linux上的ip+端口,导致一直报错。
最终配置如下:
flume配置(source采集类型使用的spooldir,如果是avro的话,则需要再启动一个agent传输数据给此agent,输出端口一定要写spark运行的主机ip+自定义端口):
#向agent1采集实例中注册一个采集源名称sc1 agent1.sources = sc1 #向agent1采集实例中注册一个缓存频道ch1 agent1.channels = ch1 #向agent1采集实例中注册一个输出槽 sk1 agent1.sinks = sk1 #以下为sc1采集源的配置 #采集源类型 spooldir agent1.sources.sc1.type = spooldir #采集到的数据通过ch1缓存频道缓存 agent1.sources.sc1.channels = ch1 #采集文件夹 agent1.sources.sc1.spoolDir = /usr/local/test #以下为ch1缓冲频道的配置 #缓冲频道类型 file agent1.channels.ch1.type = memory agent1.channels.ch1.capacity = 1000 agent1.channels.ch1.transactionCapacity = 100 #以下为输出槽设置 #输出槽类型为Avro agent1.sinks.sk1.type = avro #输出哪个缓存频道中的数据 agent1.sinks.sk1.channel = ch1 #输出到哪个主机 agent1.sinks.sk1.hostname = 10.111.121.111 #输出到哪个端口 agent1.sinks.sk1.port = 6666
#spark程序(官方例子)
val spark=SparkSession.builder().master("local[*]").appName("flumeSpark2").getOrCreate() val sc=spark.sparkContext val ssc = new StreamingContext(sc, Seconds(5)) val flumeStream:ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc, "0.0.0.0", 6666) flumeStream.map(e => "Event:header:" + e.event.get(0).toString + "body: " + new String(e.event.getBody.array)).print() ssc.start() ssc.awaitTermination()
flume监控mysql+sparkStreaming(设定时间抓取mysql的最新数据,然后传入spark创建的端口,进行数据处理。)
agent1.sinks = k1 agent1.channels = c1 # Describe/configure the source agent1.sources.r1.type = org.keedio.flume.source.SQLSource agent1.sources.r1.hibernate.connection.url = jdbc:mysql://10.111.121.236:3306/test # Hibernate Database connection properties agent1.sources.r1.hibernate.connection.user = root agent1.sources.r1.hibernate.connection.password = cetcAdmin123 agent1.sources.r1.hibernate.connection.autocommit = true agent1.sources.r1.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect agent1.sources.r1.hibernate.connection.driver_class = com.mysql.jdbc.Driver agent1.sources.r1.run.query.delay=1000 #agent1.sources.r1.table = test #agent1.sources.r1.columns.to.select = * #agent1.sources.r1.incremental.column.name = id agent1.sources.r1.incremental.value = 1 agent1.sources.r1.status.file.path = /usr/local/flume_log agent1.sources.r1.status.file.name = sqlSource.status # Custom query agent1.sources.r1.start.from = 19700000000000 agent1.sources.r1.custom.query = SELECT DATE_FORMAT(createTime, '%Y%m%d%H%i%s') as id_new,t.* FROM test t WHERE DATE_FORMAT(createTime, '%Y%m%d%H%i%s') > $@$ ORDER BY DATE _FORMAT(createTime, '%Y%m%d%H%i%s') ASC agent1.sources.r1.batch.size = 1000 agent1.sources.r1.max.rows = 1000 agent1.sources.r1.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider agent1.sources.r1.hibernate.c3p0.min_size=1 agent1.sources.r1.hibernate.c3p0.max_size=10 # Describe the sink agent1.sinks.k1.type = avro #输出哪个缓存频道中的数据 #agent1.sinks.k1.channel = c1 #输出到哪个主机 agent1.sinks.k1.hostname = 10.111.121.111 #输出到哪个端口 agent1.sinks.k1.port = 6666 # Use a channel which buffers events in memory agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel agent1.sources.r1.channels = c1 agent1.sinks.k1.channel = c1