一.文件流(DSstream)
先创建文件:
cd /usr/local/spark/mycode
mkdir streaming
cd streaming
mkdir logfile
cd logfile
touch log1.txt
touch log2.txt
打开一个Linux终端窗口,进入shell命令提示符状态:
cd /usr/local/spark/mycode/streaming
vim TestStreaming.py
在TestStreaming.py中输入如下代码:
from operator import add
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
conf = SparkConf()
conf.setAppName('TestDStream')
conf.setMaster('local[2]')
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 20)
lines = ssc.textFileStream('file:///usr/local/spark/mycode/streaming/logfile')
words = lines.flatMap(lambda line: line.split(' '))
wordCounts = words.map(lambda x : (x,1)).reduceByKey(add)
wordCounts.pprint()
ssc.start()
ssc.awaitTermination()
保存成功后,执行在命令行中执行如下代码:
python3 TestStreaming.py
2.套接字流(Dstream)
套接字(socket)是一个抽象层,应用程序可以通过它发送或接收数据,可对其进行像对文件一样的打开、读写和关闭等操作。套接字允许应用程序将I/O插入到网络中,并与网络中的其他应用程序进行通信。网络套接字是IP地址与端口的组合。
Spark Streaming可以通过Socket端口监听并接收数据,然后进行相应处理。
先编写Socket服务端
cd/usr/local/spark/mycode/streaming/socket
vim DataSourceSocket.py
import socket
server=socket.socket()
server:blind('localhost',9999)
server:listen(1)
while 1:
print('I am waiting the connect...')
conn,addr=server.accept()
print("Connect success!Connection is from %s" %addr[0])
print('sending data')
conn.send('what waht a a a i'.encode())
conn.close()
print('connection' is broken.')
新打开一个Shell窗口,进入Shell命令提示符状态,然后执行下面命令:
cd /usr/local/spark/mycode
mkdir streaming #如果已经存在该目录,则不用创建
vim NetworkWordCount.py
编辑NetworkWordCount.py
rom __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: network_wordcount.py <hostname> <port>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
counts = lines.flatMap(lambda line: line.split(" "))\
.map(lambda word: (word, 1))\
.reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
保存成功后,执行如下代码:
sudo nc -lk 9999
接着
cd /usr/local/spark/mycode/streaming/socket
/usr/local/spark/bin/spark-submit NetworkWordcount.py localhost 9999
然后打开第二个终端作为监听窗口,执行如下代码:
#启动客户端
cd /usr/local/spark/mycode/streaming
python3 NetworkWordCount.py localhost 9999
这样,就可以在nc第一个终端窗口窗口中随意输入一些单词,监听窗口就会自动获得单词数据流信息,在监听窗口每隔1秒就会打印出词频统计信息,大概会再屏幕上出现类似如下的结果:
-------------------------------------------
Time: 1479431100000 ms
-------------------------------------------
(hello,1)
(world,1)
-------------------------------------------
Time: 1479431120000 ms
-------------------------------------------
(hadoop,1)
-------------------------------------------
Time: 1479431140000 ms
-------------------------------------------
(spark,1)
如果要停止运行上述程序,按键盘Ctrl+Z
3.RDD队列流
在调试Spark Streaming应用程序的时候,我们可以使用streamingContext.queueStream(queueOfRDD)创建基于RDD队列的DStream。
下面是参考Spark官网的QueueStream程序设计的程序,每隔1秒创建一个RDD,Streaming每隔2秒就对数据进行处理。
请登录Linux系统,打开一个终端,进入Shell命令提示符状态,然后执行下面命令新建代码文件:
cd /usr/local/spark/mycode/streaming/
vim TestRDDQueueStream.py
上面用vim编辑器新建了一个TestRDDQueueStream.py文件,请在该文件中输入以下代码:
import time
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingQueueStream")
ssc = StreamingContext(sc, 1)
# Create the queue through which RDDs can be pushed to
# a QueueInputDStream
rddQueue = []
for i in range(5):
rddQueue += [ssc.sparkContext.parallelize([j for j in range(1, 1001)], 10)]
# Create the QueueInputDStream and use it do some processing
inputStream = ssc.queueStream(rddQueue)
mappedStream = inputStream.map(lambda x: (x % 10, 1))
reducedStream = mappedStream.reduceByKey(lambda a, b: a + b)
reducedStream.pprint()
ssc.start()
time.sleep(6)
ssc.stop(stopSparkContext=True, stopGraceFully=True)
执行代码:
python3 ./TestRDDQueueStream.py
执行结果:
-------------------------------------------
Time: 1479522100000 ms
-------------------------------------------
(4,10)
(0,10)
(6,10)
(8,10)
(2,10)
(1,10)
(3,10)
(7,10)
(9,10)
(5,10)