Spark Streaming的改进word count例子

最新推荐文章于 2020-12-18 12:45:52 发布

大胖头leo

最新推荐文章于 2020-12-18 12:45:52 发布

阅读量284

点赞数

分类专栏： PySpark学习日志 Python 文章标签： pyspark wordcount

本文链接：https://blog.csdn.net/a8131357leo/article/details/100945452

版权

PySpark学习日志同时被 2 个专栏收录

40 篇文章 3 订阅

订阅专栏

Python

19 篇文章 0 订阅

订阅专栏

Spark Streaming 里提供的wordcount有点太简单了，做了一次实在时没啥感觉，所以我稍微弄的复杂了一点，通过这个让我对streaming 有更好的理解

wordcount功能：

通过socket向Spark APP发送一串数字字符（“，“分割），然后将数字转换 Int格式并进行计数，如果输入的不是数字，Spark APP报错并停止APP

socket 文件 socket.py

socket对象在关闭之后，再打开会有address already use 错误，需要设置 setsocketpt（socket.SO_REUSEADDR，1），这样可以重复使用同一地址，然后回生成一个5~15长度的字符串，然后通过socket发送。

socket只能发送byte编码的数据，所以设置编码或者b'aaaaa'这样也行

https://github.com/a8131357/wordcount.git

import random
import socket
from time import sleep
host = 'localhost'
port = 9999
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((host, port))
s.listen(1)
print('\nListening for a client at',host , port)
conn, addr = s.accept()
print('\nConnected by', addr)

count = 0
try:
    while True:
        
        # wordcount发送的文件，随机5~15的数字
        line = [str(random.randint(1,20)) for x in range(random.randint(5,15))]
        line = ",".join(line)+"\n"
        
        # socket只能发送byte编码的数据，所以设置编码 或者b'aaaaa'这样也行
        conn.send(line.encode('utf-8'))
        sleep(2)
        count += 1
        
        if count == 20:

            #发送一个文字流包含字母的流
            conn.send("a,b,c,d,e".encode('utf-8'))
            conn.close()
        
    
except socket.error:
    print ('Error Occured.\n\nClient disconnected.\n')

s.shutdown(socket.SHUT_RDWR)
s.close()

wordCount文件

streaming如果发现输入数据有问题则会停止streaming APP并报错，实现方法

ssc.awaitTerminationOrTimeout(30)
ssc.awaitTermination()

awaitTermination会等待Streaming APP，两个出口，timeout：结束时间，这里设置30秒自动结束

或者streaming APP里 throw一个exception，使用try except实现。

ssc.stop()：默认会关闭sparkContext，streamingContext，可以设置保留sparkContext使得APP可以快速重新启动，stopGracefully = True 会等待所有的线程的任务都完成后才关闭streaming APP, 而不会强制暂停

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 5)

lines = ssc.socketTextStream("localhost", 9999)

words = lines.flatMap(lambda line: line.split(",")).map(lambda x: int(x))
pairs = words.map(lambda word: (word, 1))


# Count each word in each batch

wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()



ssc.start() # Start the computation

try:
    ssc.awaitTerminationOrTimeout(30) 
    ssc.stop(stopSparkContext=False, stopGraceFully=True)
except:
    print('input wrong')
    ssc.stop(stopSparkContext=False, stopGraceFully=True)

结果

大胖头leo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark Streaming的改进word count例子

Spark Streaming 里提供的wordcount有点太简单了，做了一次实在时没啥感觉，所以我稍微弄的复杂了一点，通过这个让我对streaming 有更好的理解wordcount功能：通过socket向Spark APP发送一串数字字符（“，“分割），然后将数字转换 Int格式并进行计数，如果输入的不是数字，Spark APP报错并停止APPsocket 文件 ...
复制链接

扫一扫

专栏目录