欢迎使用Markdown编辑器
你好! 这是学习pyspark的记录。
spark streaming
Spark Streaming利用Spark Core的快速调度能力执行流数据的分析。它以最小批次获取数据,并对批次上的数据执行RDD转化。
这样的设计,可以让用于批处理分析的Spark应用程序代码也可以用于流数据分析,因此便于实时大数据处理架构的实现。但是这种便利性带来的问题是处理最小批次数据的延时。
其他流数据处理引擎,例如Storm和Flink的streaming组件,都是以事件而不是最小批次为单位处理流数据的。Spark Streaming支持从Kafka、Flume、Twitter、ZeroMQ、Kinesis和TCP/IP sockets接收数据。
无状态转换
# -*- coding: utf-8 -*-
# @Time : 2020/7/3 10:33
# @Author : ljk
# @Email : ljk13572@163.com
# spark streaming 本地文件流wordcount -> 无状态
# https://blog.csdn.net/weixin_43931941/article/details/105386131
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# 创建streaming环境
sc = SparkContext('local','ljk_streaming')
# 设置滑动窗口
ssc = StreamingContext(sc,10)
# 读取文件流
fileDStream = ssc.textFileStream('data/')
# 遍历输入内容,执行rdd转化,行动算子
fileDStream.foreachRDD(lambda rdd:print(rdd.collect()))
# 打印控制台
fileDStream.pprint()
# 对流进行转换
# 对流进行转换
result = fileDStream.flatMap(lambda line:line.split(',')).map(lambda word:(word, 1)).reduceByKey(lambda a,b:a+b)
result.pprint()
# 启动streaming
ssc.start()
ssc.awaitTermination()
有状态转换
# -*- coding: utf-8 -*-
# @Time : 2020/7/3 11:27
# @Author : ljk
# @Email : ljk13572@163.com
# spark streaming 本地文件流wordcount -> 有状态
# https://blog.csdn.net/a8131357leo/article/details/101006510
def updateFunction(newValues, runningCount):
# 对于不存在的Key,他的value就是None
if runningCount is None:
runningCount = 0
return sum(newValues, runningCount)
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext('local','ljk_streaming')
ssc = StreamingContext(sc, 10)
ssc.checkpoint("checkpoint")
lines = ssc.textFileStream('data/')
# 生成一个初始的dic,用来保存计数,其实不用加初始值
initial_dic = sc.parallelize(range(1, 5)).map(lambda x: (x, 0))
words = lines.flatMap(lambda line: line.split("\n")).map(lambda word: (word, 1))
# reduceBykey得到一个RDD内的计数,然后根据计数再去更新数据
wordCounts = words.reduceByKey(lambda x, y: x + y).updateStateByKey(updateFunction)
wordCounts.pprint()
# 启动streaming
ssc.start()
ssc.awaitTermination()
sockets数据流
生产数据
# -*- coding: utf-8 -*-
# @Time : 2020/7/3 14:53
# @Author : ljk
# @Email : ljk13572@163.com
import random
import socket
from time import sleep
host = 'localhost'
port = 9999
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((host, port))
s.listen(1)
print('\nListening for a client at', host, port)
conn, addr = s.accept()
print('\nConnected by', addr)
count = 0
lists = ['a','b']
try:
while True:
# wordcount发送的文件,随机5~15的数字
line = [lists[random.randint(0, 1)] for x in range(2)]
line = ",".join(line) + "\n"
print(line)
# socket只能发送byte编码的数据,所以设置编码 或者b'aaaaa'这样也行
conn.send(line.encode('utf-8'))
sleep(2)
count += 1
if count == 20:
# 发送一个文字流包含字母的流
# conn.send("a,b,c,d,e".encode('utf-8'))
conn.close()
except socket.error:
print('Error Occured.\n\nClient disconnected.\n')
# s.shutdown(socket.SHUT_RDWR)
# s.close()
pyspark streaming处理数据
# -*- coding: utf-8 -*-
# @Time : 2020/7/3 14:52
# @Author : ljk
# @Email : ljk13572@163.com
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext('local','ljk_streaming')
ssc = StreamingContext(sc, 5)
lines = ssc.socketTextStream("localhost", 9999)
words = lines.flatMap(lambda line: line.split(","))
pairs = words.map(lambda word: (word, 1))
# Count each word in each batch
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()
ssc.start() # Start the computation
try:
ssc.awaitTerminationOrTimeout(30)
ssc.stop(stopSparkContext=False, stopGraceFully=True)
except:
print('input wrong')
ssc.stop(stopSparkContext=False, stopGraceFully=True)