SparkStreaming数据源Flume实际案例分享

6 篇文章 0 订阅
5 篇文章 0 订阅
本文分享了如何使用Spark Streaming从Flume获取数据的实战案例,详细讲解了配置Flume-config,编写SparkStreamingPullFlume.java源代码,以及安装相关jar包到Flume的lib目录下。通过实例展示了数据的接收、处理和打印过程,同时分析了源代码中的关键部分,如FlumePollingInputDStream的工作原理和消息传输协议。
摘要由CSDN通过智能技术生成

本期内容:

1.Spark Streaming on polling from Flume实战

2.Spark Streaming on polling from Flume源码

FlumeConnection:分布式连接的Flume实体

I.实战

一.通过Spark Streaming主动从Flume这边获取数据,首先配置Flume-config配置文件

二.编写源代码SparkStreamingPullFlume.java

/*DT大数据梦工厂微信公众号DT_Spar

*/

package com.dt.spark.SparkApps.SparkStreaming;

import java.util.Arrays;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.FlatMapFunction;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.streaming.Duration;

import org.apache.spark.streaming.Durations;

import org.apache.spark.streaming.api.java.JavaDStream;

import org.apache.spark.streaming.api.java.JavaPairDStream;

import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;

import org.apache.spark.streaming.api.java.JavaStreamingContext;

import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;

import org.apache.spark.streaming.flume.FlumeUtils;

import org.apache.spark.streaming.flume.SparkFlumeEvent;

import scala.Tuple2;

public class SparkStreamingPullFlume {

public static void main(String[] args) {

final SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("FlumePushDate2SparkStreaming");

JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(30));

     JavaReceiverInputDStream<SparkFlumeEvent> lines = FlumeUtils.createPollingStream(jsc, "Master", 9999);

JavaDStream<String> words = lines.flatMap(new FlatMapFunction<SparkFlumeEvent, String>() {

private static final long serialVersionUID = 1L;

public Iterable<String> call(SparkFlumeEvent arg0) throws Exception {

String line = new String(arg0.event().getBody().array());

return Arrays.asList(line.split(" "));

}

});

JavaPairDStream<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String arg0) throws Exception {

return new Tuple2<String, Integer>(arg0, 1);

}

});

/*IMF晚8点大数据实战YY直播频道号:68917580*/

JavaPairDStream<String, Integer> wordsCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {

public Integer call(Integer v1, Integer v2) throws Exception {

return v1 + v2;

}

});

wordsCount.print();

jsc.start();

jsc.awaitTermination();

jsc.close();

}

}

三.安装三个jar包到flumelib目录下
commons-lang3-3.3.2.jar,scala-library-2.10.4.jar,spark-streaming-flume-sink_2.10-1.6.1.jar

 

四.运行eclipse中的程序SparkStreamingPullFlume.java


五.运行flume-ng

./flume-ng agent -n agent1 -c conf -f/usr/local/flume/apache-flume-1.6.0-bin/conf/flume-conf.properties -Dflume.root.logger=DEBUG,console

六.将hello.txt拷贝到/usr/local/flume/apache-flume-1.6.0-bin/tmp/TestDir/下面

Hello Spark

Hello Hadoop

Hello Kafka

Hello HDFS

II.源代码分析

一.创建createPollingStre



继承自ReceiverInputDstream中覆写getReciver方法,调用FlumePollingReciver接口


使用lazy和工厂方法创建线程和Nio客户端socket


工作线程从Flume Pollingpull 数据,实质上是从消息队列中获取数据


run()方法中的receiver.getConnections.poll()中的poll方法


发现dequeue出消息队列:进入dequeue中观察奇妙的现象出来了发现enqueue投递消息队列中我们发现调用enqueue方法的地方


发现FlumePollingInputDStream.ca

[FlumeConnectionfaa





其中   

 sendNack(batchReceived, client, seq)
  }
case exception: Exception =>
  logWarning("Error while receiving data from Flume", exception)
  sendNack(batchReceived, client, seq)

sendNack,sendAck个人理解像tcp三次握手协议一样,传输协议之际数据传输的互相约定,再回头看poll接口其实就是从消息队列中获取要处理的消息

poll:监听数据链接端口,获得数据后放入收入消息的线程的主消息队列中,然后分发到工作线程的消息队列中进行数据提取


新浪微博:http://www.weibo.com/ilovepains
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值