Spark Streaming(7):SparkStreaming从Flume poll数据案例实战和内幕源码解密

本期内容:
1.Spark Streaming on polling from Flume实战
2.Spark Streaming on polling from Flume源码

FlumeConnection:分布式连接的Flume实体

一.通过Spark Streaming主动从Flume这边获取数据,首先配置Flume-config配置文件
 
二.编写源代码SparkStreamingPullFlume.java
package com.dt.spark.SparkApps.SparkStreaming;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
import org.apache.spark.streaming.flume.FlumeUtils;
import org.apache.spark.streaming.flume.SparkFlumeEvent;
import scala.Tuple2;
public class SparkStreamingPullFlume {
    public static void main(String[] args) {
        final SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("FlumePushDate2SparkStreaming");
        JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(30));
        JavaReceiverInputDStream lines = FlumeUtils.createPollingStream(jsc, "Master", 9999);
        JavaDStream words = lines.flatMap(new FlatMapFunction() { 
            private static final long serialVersionUID = 1L;
            public Iterable call(SparkFlumeEvent arg0) throws Exception {
                String line = new String(arg0.event().getBody().array());
                return Arrays.asList(line.split(" "));
            }
        });
        JavaPairDStream pairs = words.mapToPair(new PairFunction() {
              public Tuple2 call(String arg0) throws Exception {
                   return new Tuple2(arg0, 1);
                }
          });
          JavaPairDStream wordsCount = pairs.reduceByKey(new Function2() { 
               public Integer call(Integer v1, Integer v2) throws Exception {
                     return v1 + v2;
               }
          });
          wordsCount.print();
          jsc.start();
          jsc.awaitTermination();
          jsc.close();
     }
}
三.安装三个jar包到flume的lib目录下
commons-lang3-3.3.2.jar,scala-library-2.10.4.jar,spark-streaming-flume-sink_2.10-1.6.1.jar


四.运行eclipse中的程序SparkStreamingPullFlume.java
 

五.运行flume-ng
./flume-ng agent -n agent1 -c conf -f/usr/local/flume/apache-flume-1.6.0-bin/conf/flume-conf.properties -Dflume.root.logger=DEBUG,console
.将hello.txt拷贝到/usr/local/flume/apache-flume-1.6.0-bin/tmp/TestDir/下面
Hello Spark
Hello Hadoop
Hello Kafka
Hello HDFS
七.获得运行结果
 
II.源代码分析
一.创建createPollingStream:
 
 
继承自ReceiverInputDstream中覆写getReciver方法,调用FlumePollingReciver接口

使用lazy和工厂方法创建线程和Nio客户端socket
 
工作线程从Flume Polling中pull 数据,实质上是从消息队列中获取数据

看run()方法中的receiver.getConnections.poll()中的poll方法:

发现dequeue出消息队列:进入dequeue中观察奇妙的现象出来了发现enqueue投递消息队列中我们发现调用enqueue方法的地方

发现flemPollingInputDstream.scala中,初始化LinkedBlockingQueue[FlumeConnection]()

  在onStart()方法中
 
其中   
 sendNack(batchReceived, client, seq)
  }
case exception: Exception =>
  logWarning("Error while receiving data from Flume", exception)
  sendNack(batchReceived, client, seq)
sendNack,sendAck个人理解像tcp三次握手协议一样,传输协议之际数据传输的互相约定
载回头看poll接口其实就是从消息队列中获取要处理的数据

poll:监听数据链接端口,获得数据后放入收入消息的线程的主消息队列中,然后分发到工作线程的消息队列中进行数据提取

主编辑:王家林

资料来源于:DT_大数据梦工厂(IMF传奇行动绝密课程)

更多私密内容,请关注微信公众号:DT_Spark

如果您对大数据Spark感兴趣,可以免费听由王家林老师每天晚上20:00开设的Spark永久免费公开课,YY房间号:68917580


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/31133864/viewspace-2091114/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/31133864/viewspace-2091114/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值