Spark Streaming 里提供的wordcount有点太简单了,做了一次实在时没啥感觉,所以我稍微弄的复杂了一点,通过这个让我对streaming 有更好的理解
wordcount功能:
通过socket向Spark APP发送一串数字字符(“,“分割),然后将数字转换 Int格式并 进行计数, 如果输入的不是数字,Spark APP报错并停止APP
socket 文件 socket.py
socket对象在关闭之后,再打开会有address already use 错误,需要设置 setsocketpt(socket.SO_REUSEADDR,1),这样可以重复使用同一地址, 然后回生成一个5~15长度的字符串,然后通过socket发送。
socket只能发送byte编码的数据,所以设置编码 或者b'aaaaa'这样也行
https://github.com/a8131357/wordcount.git
import random
import socket
from time import sleep
host = 'localhost'
port = 9999
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((host, port))
s.listen(1)
print('\nListening for a client at',host , port)
conn, addr = s.accept()
print('\nConnected by', addr)
count = 0
try:
while True:
# wordcount发送的文件,随机5~15的数字
line = [str(random.randint(1,20)) for x in range(random.randint(5,15))]
line = ",".join(line)+"\n"
# socket只能发送byte编码的数据,所以设置编码 或者b'aaaaa'这样也行
conn.send(line.encode('utf-8'))
sleep(2)
count += 1
if count == 20:
#发送一个文字流包含字母的流
conn.send("a,b,c,d,e".encode('utf-8'))
conn.close()
except socket.error:
print ('Error Occured.\n\nClient disconnected.\n')
s.shutdown(socket.SHUT_RDWR)
s.close()
wordCount文件
streaming如果发现输入数据有问题则会停止streaming APP并报错, 实现方法
ssc.awaitTerminationOrTimeout(30)
ssc.awaitTermination()
awaitTermination会等待Streaming APP, 两个出口,timeout:结束时间,这里设置30秒自动结束
或者streaming APP里 throw一个exception,使用try except实现。
ssc.stop():默认会关闭sparkContext,streamingContext,可以设置保留sparkContext使得APP可以快速重新启动,stopGracefully = True 会等待所有的线程的任务都完成后才关闭streaming APP, 而不会强制暂停
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 5)
lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(",")).map(lambda x: int(x))
pairs = words.map(lambda word: (word, 1))
# Count each word in each batch
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start() # Start the computation
try:
ssc.awaitTerminationOrTimeout(30)
ssc.stop(stopSparkContext=False, stopGraceFully=True)
except:
print('input wrong')
ssc.stop(stopSparkContext=False, stopGraceFully=True)
结果