一、创建一个SparkStreaming
1、StreamingContext
val ssc = new StreamingContext(sparkConf, Seconds(10))
- 是流计算功能的主要入口
- 会在底层创建SparkContext,用来处理数据
- 构造函数接受用来指定多长时间处理一次新数据的批次间隔(batch interval),单位s
2、socketTextStream
val lines = ssc.socketTextStream("localhost", 9999)
创建出基于本地9999端口上收到的文本数据DStream
3、start()
- 只要设定好了要进行的计算,系统收到数据时就算就会开始
- 要开始接受数据,必须显示调用StreamingContext的start()方法。这样Spark Streaming就会开始把Spark作业不断交给下面的Spark Context去调度执行
4、awaitTermination()
执行会在另一个线程中进行,所以需要调用awaitTermination()来等待流计算完成,以防止应用退出
二、运行
1.运行任务
1>标准输出 2>错误输出
bash wc_local.sh 1>1.log 2>2.log
2.监控日志
tail -f 1.log
3.打开端口
nc -l 9999
4.测试wordCount
package com.albert.streaming.test
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
object WordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: wordCount <hostname> <port>")
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("StreamingWordCount")
val streamCtx = new StreamingContext(sparkConf, Seconds(5))
val lines = streamCtx.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
wordCounts.saveAsTextFiles("hdfs://master:9000/stream_out", "doc")
streamCtx.start()
streamCtx.awaitTermination()
}
}
4.1local
wc_local.sh
/usr/local/src/spark-1.6.0-bin-hadoop2.6/bin/spark-submit \
--master local[2] \
--class com.albert.streaming.test.WordCount /usr/local/src/learn/albert/25_spark_streaming/streaming-1.0-SNAPSHOT.jar \
master \
9999
4.2standalone
wc_standalone.sh
/usr/local/src/spark-1.6.0-bin-hadoop2.6/bin/spark-submit \
--master spark://master:7077 \
--num-executors 2 \
--executor-memory 1g \
--executor-cores 2 \
--driver-memory 1g \
--class com.albert.streaming.test.WordCount /usr/local/src/learn/albert/25_spark_streaming/streaming-1.0-SNAPSHOT.jar \
master \
9999
4.3cluster
wc_cluster.sh
/usr/local/src/spark-1.6.0-bin-hadoop2.6/bin/spark-submit \
--master yarn-cluster \
--num-executors 2 \
--executor-memory 1g \
--executor-cores 2 \
--driver-memory 1g \
--class com.albert.streaming.test.WordCount /usr/local/src/learn/albert/25_spark_streaming/streaming-1.0-SNAPSHOT.jar \
master \
9999
杀掉任务
yarn application -kill application_1590989536331_0001
5.测试WordCountWithState
会保存历史的信息
package com.albert.streaming.test
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
object WordCountWithState {
def updateFunction(currentValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
val current = currentValues.sum
val pre = preValues.getOrElse(0)
Some(current + pre)
}
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage:WordCountWithState<hostname> <port>")
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("StreamingWordCountWithState")
val streamCtx = new StreamingContext(sparkConf, Seconds(5))
streamCtx.checkpoint("hdfs://master:9000/hdfs_checkpoint")
val lines = streamCtx.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).updateStateByKey(updateFunction _)
wordCounts.print()
wordCounts.saveAsTextFiles("hdfs://master:9000/stream_state_out", "doc")
streamCtx.start()
streamCtx.awaitTermination()
}
}
6.测试windowTest
只保留某个时间点往前的信息,比如只保存5秒钟,9:00:00的时候,此时的数据是8:59:55-9:00:00这个时间段内的数据,9:00:30的时候,此时的数据是9:00:25-9:00:30这个时间段内的数据,
package com.albert.streaming.test
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
object WindowTest {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("StreamingWindowTest")
val streamCtx = new StreamingContext(sparkConf, Seconds(10))
streamCtx.checkpoint("hdfs://master:9000/hdfs_checkpoint")
val lines = streamCtx.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow((v1: Int, v2: Int) => v1 + v2, Seconds(30), Seconds(10))
wordCounts.print()
streamCtx.start()
streamCtx.awaitTermination()
}
}
只测本地,其他测试同4
附pom.xml
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.0</version>
</dependency>