监听文件流
- 创建文件
在log1.txt输入
- A终端打开spark-shell,输入命令
此时监听端已打开
- A终端创建新文件log2.txt
I love Hadoop
I love Spark
Spark is slow
A结果显示:
spark监听套接字流
- 安装nc工具,并开启端口9999
nc -lk 9999
9999端口如果没打开tcp,需要使用farewall命令开启
- 编写监听程序
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
// Create the context with a 1 second batch size
val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
// Create a socket stream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by 'nc')
// Note that no duplication in storage level only for running locally.
// Replication necessary in distributed scenario for fault tolerance.
val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
StreamingExamples类
import org.apache.spark.internal.Logging
import org.apache.log4j.{Level, Logger}
object StreamingExamples extends Logging {
/** Set reasonable logging levels for streaming if the user has not configured log4j. */
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
// We first log something to initialize Spark's default logging, then we override the
// logging level.
logInfo("Setting log level to [WARN] for streaming example." +
" To override add a custom log4j.properties to the classpath.")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
}
运行任务
./spark-submit --class "NetworkWordCount" /home/hadoop/hbaseoperation_2.11-0.1.jar localhost 9999
问题
Error connecting to localhost:9999 java.net.ConnectException: Connection refused
解决办法:先运行nc -lk 9999 ,再提交spark任务
socket读取文件写入端口
import java.io.{PrintWriter}
import java.net.ServerSocket
import scala.io.Source
object DataSourceSocket {
def index(length: Int) = {
val rdm = new java.util.Random
rdm.nextInt(length)
}
def main(args: Array[String]) {
if (args.length != 3) {
System.err.println("Usage: <filename> <port> <millisecond>")
System.exit(1)
}
val fileName = args(0)
val lines = Source.fromFile(fileName).getLines.toList
val rowCount = lines.length
val listener = new ServerSocket(args(1).toInt)
while (true) {
val socket = listener.accept()
new Thread() {
override def run = {
println("Got client connected from: " + socket.getInetAddress)
val out = new PrintWriter(socket.getOutputStream(), true)
while (true) {
Thread.sleep(args(2).toLong)
val content = lines(index(rowCount))
println(content)
out.write(content + '\n')
out.flush()
}
socket.close()
}
}.start()
}
}
}
运行
./spark-submit --class "DataSourceSocket" /home/hadoop/hbaseoperation_2.11-0.1.jar /home/hadoop/streaming/logfile/log1.txt 9999 1000
spark启动监听程序
./spark-submit --class "NetworkWordCount" /home/hadoop/hbaseoperation_2.11-0.1.jar localhost 9999
运行结果:
数据发送端
spark处理端
spark读取RDD
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object QueueStream {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("TestRDDQueue").setMaster("local[2]")
//每20秒执行一次任务
val ssc = new StreamingContext(sparkConf, Seconds(20))
val rddQueue =new scala.collection.mutable.SynchronizedQueue[RDD[Int]]()
val queueStream = ssc.queueStream(rddQueue)
val mappedStream = queueStream.map(r => (r % 10, 1))
val reducedStream = mappedStream.reduceByKey(_ + _)
reducedStream.print()
ssc.start()
for (i <- 1 to 10){
rddQueue += ssc.sparkContext.makeRDD(1 to 100,2)
Thread.sleep(1000)
}
ssc.stop()
}
}
运行及结果
./spark-submit --class "QueueStream" /home/hadoop/hbaseoperation_2.11-0.1.jar
Apache Kafka作为DStream数据源
- 安装并测试kafka
- 打开终端A,启动zookeeper
cd /usr/local/kafka
./bin/zookeeper-server-start.sh config/zookeeper.properties
- 打开终端B,启动kafka的broker
cd /usr/local/kafka
bin/kafka-server-start.sh config/server.properties
- 打开终端C,创建主题并发送数据
cd /usr/local/kafka
./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wordsendertest
./bin/kafka-topics.sh --list --zookeeper localhost:2181
- 打开终端D,启动producer
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wordsendertest
命令执行后 输入hello spark
- 打开终端E,启动consumer
cd /usr/local/kafka
./bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wordsendertest --from-beginning
可以开到收到了消息
- 下载spark-streaming-kafka_2.11.jar
可以使用idea的sbt下载好jar后,复制到spark-2.4.3/jars目录下,此外Kafka安装目录的libs目录下的所有jar文件复制到/spark-2.4.3/jars/kafka目录下
- 编写kafka的producer
import java.util.HashMap
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object KafkaWordProducer {
def main(args: Array[String]) {
if (args.length < 4) {
System.err.println("Usage: KafkaWordCountProducer <metadataBrokerList> <topic> " +
"<messagesPerSec> <wordsPerMessage>")
System.exit(1)
}
val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args
// Zookeeper connection properties
val props = new HashMap[String, Object]()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](props)
// Send some messages
while(true) {
(1 to messagesPerSec.toInt).foreach { messageNum =>
val str = (1 to wordsPerMessage.toInt).map(x => scala.util.Random.nextInt(10).toString)
.mkString(" ")
print(str)
println()
val message = new ProducerRecord[String, String](topic, null, str)
producer.send(message)
}
Thread.sleep(1000)
}
}
}
- 编写kafka的consumer
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaWordCount {
def main(args:Array[String]){
val sc = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sc,Seconds(10))
ssc.checkpoint("file:///home/hadoop/spark-2.4.3/mycode/kafka/checkpoint") //设置检查点,如果存放在HDFS上面,则写成类似ssc.checkpoint("/user/hadoop/checkpoint")这种形式,但是,要启动hadoop
val zkQuorum = "localhost:2183" //Zookeeper服务器地址
val group = "consumer-grp" //topic所在的group,可以设置为自己想要的名称,比如不用1,而是val group = "test-consumer-group"
val topics = "wordsender" //topics的名称
val numThreads = 1 //每个topic的分区数
val topicMap =topics.split(",").map((_,numThreads.toInt)).toMap
val lineMap = KafkaUtils.createStream(ssc,zkQuorum,group,topicMap)
val lines = lineMap.map(_._2)
val words = lines.flatMap(_.split(" "))
val pair = words.map(x => (x,1))
val wordCounts = pair.reduceByKeyAndWindow(_ + _,_ - _,Minutes(2),Seconds(10),2) //这行代码的含义在下一节的窗口转换操作中会有介绍
wordCounts.print
ssc.start
ssc.awaitTermination
}
}
StreamingExamples类代码
import org.apache.spark.internal.Logging
import org.apache.log4j.{Level, Logger}
object StreamingExamples extends Logging {
/** Set reasonable logging levels for streaming if the user has not configured log4j. */
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
// We first log something to initialize Spark's default logging, then we override the
// logging level.
logInfo("Setting log level to [WARN] for streaming example." +
" To override add a custom log4j.properties to the classpath.")
Logger.getRootLogger.setLevel(Level.WARN)
}
}
}
- 执行producer
./spark-submit --driver-class-path /usr/local/spark/jars/*:/home/hadoop/spark-2.4.3/jars/kafka/* --class "KafkaWordProducer" /home/hadoop/hbaseoperation_2.11-0.1.jar localhost:9092 wordsender 3 5
- 执行consumer
./spark-submit --driver-class-path /home/hadoop/spark-2.4.3/jars/*:/home/hadoop/spark-2.4.3/jars/kafka/* --class "KafkaWordCount" /home/hadoop/hbaseoperation_2.11-0.1.jar
此时报错:Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging
原因: org.apache.spark.Logging is available in Spark version 1.5.2 or lower version. It is not in the 2.0.0. Pls change versions
解决方法是下载org.apache.spark.Logging的jar,放到spark的jars目录
参考 https://stackoverflow.com/questions/40287289/java-lang-noclassdeffounderror-org-apache-spark-logging
再次运行,仍报错: Exception in thread "dispatcher-event-loop-1" java.lang.NoClassDefFoundError: Lkafka/consumer/ConsumerConnector
猜测应该是kafka和spark版本不兼容,最好是切换spark版本到1.6以下