Spark Streaming 是 Spark 平台上针对实时数据进行流式计算的组件,提供了丰富的处理数据流的 API。Spark Streaming底层的数据处理单位是:DStream ; 主要是处理流式数据(数据一直不停的在向Spark程序发送),这里可以结合 Spark Core 和 Spark SQL 来处理数据,如果来源数据是非结构化的数据,那么我们这里就可以结合 Spark Core 来处理,如果数据为结构化的数据,那么我们这里就可以结合Spark SQL 来进行处理。它的工作示意图如下:
简单示例
1.监听TCP套接字的数据服务器接收到的文本数据中的单词数。您需要做的就是如下。
安装netcat
启动8888端口
./nc -lp 8888
引入依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>2.2.0</version>
</dependency>
java代码如下:
public class NetworkWordCount {
public static void main(String[] args) throws InterruptedException {
// Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream that will connect to hostname:port, like localhost:8888
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 8888);
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator());
// Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2);
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print();
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
}
}
2.监听kafka主题消息
首先创建kafka主题并发布消息
引入依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.10</artifactId>
<version>2.2.0</version>
</dependency>
Java代码如下:
public class SqlSparkStream {
public static void main(String[] args) throws InterruptedException {
SparkConf conf = new SparkConf().setMaster("local").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "127.0.0.1:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("dome-topic");
JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
);
JavaDStream< String> map = stream.map(record -> {
return record.value();
});
map.print();
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
}
}