pyspark入门

最新推荐文章于 2024-08-31 22:36:43 发布

_PREPER_MAN

最新推荐文章于 2024-08-31 22:36:43 发布

阅读量303

点赞数

分类专栏： pyspark 文章标签： python 大数据

本文链接：https://blog.csdn.net/weixin_39633943/article/details/107164749

版权

pyspark 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

pyspark入门

欢迎使用Markdown编辑器
- spark streaming

欢迎使用Markdown编辑器

你好！这是学习pyspark的记录。

spark streaming

Spark Streaming利用Spark Core的快速调度能力执行流数据的分析。它以最小批次获取数据，并对批次上的数据执行RDD转化。
这样的设计，可以让用于批处理分析的Spark应用程序代码也可以用于流数据分析，因此便于实时大数据处理架构的实现。但是这种便利性带来的问题是处理最小批次数据的延时。
其他流数据处理引擎，例如Storm和Flink的streaming组件，都是以事件而不是最小批次为单位处理流数据的。Spark Streaming支持从Kafka、Flume、Twitter、ZeroMQ、Kinesis和TCP/IP sockets接收数据。

无状态转换

# -*- coding: utf-8 -*-
# @Time    : 2020/7/3 10:33
# @Author  : ljk
# @Email   : ljk13572@163.com

# spark streaming 本地文件流wordcount -> 无状态

# https://blog.csdn.net/weixin_43931941/article/details/105386131
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# 创建streaming环境

sc = SparkContext('local','ljk_streaming')

# 设置滑动窗口
ssc = StreamingContext(sc,10)

# 读取文件流
fileDStream = ssc.textFileStream('data/')

# 遍历输入内容，执行rdd转化，行动算子
fileDStream.foreachRDD(lambda rdd:print(rdd.collect()))
# 打印控制台
fileDStream.pprint()

# 对流进行转换
# 对流进行转换
result = fileDStream.flatMap(lambda line:line.split(',')).map(lambda word:(word, 1)).reduceByKey(lambda a,b:a+b)
result.pprint()
# 启动streaming
ssc.start()
ssc.awaitTermination()

有状态转换

# -*- coding: utf-8 -*-
# @Time    : 2020/7/3 11:27
# @Author  : ljk
# @Email   : ljk13572@163.com

# spark streaming 本地文件流wordcount -> 有状态
# https://blog.csdn.net/a8131357leo/article/details/101006510



def updateFunction(newValues, runningCount):
    # 对于不存在的Key，他的value就是None

    if runningCount is None:
        runningCount = 0
    return sum(newValues, runningCount)


from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext('local','ljk_streaming')
ssc = StreamingContext(sc, 10)
ssc.checkpoint("checkpoint")

lines = ssc.textFileStream('data/')

# 生成一个初始的dic，用来保存计数，其实不用加初始值
initial_dic = sc.parallelize(range(1, 5)).map(lambda x: (x, 0))

words = lines.flatMap(lambda line: line.split("\n")).map(lambda word: (word, 1))


# reduceBykey得到一个RDD内的计数，然后根据计数再去更新数据
wordCounts = words.reduceByKey(lambda x, y: x + y).updateStateByKey(updateFunction)

wordCounts.pprint()
# 启动streaming
ssc.start()
ssc.awaitTermination()

sockets数据流

生产数据

# -*- coding: utf-8 -*-
# @Time    : 2020/7/3 14:53
# @Author  : ljk
# @Email   : ljk13572@163.com


import random
import socket
from time import sleep

host = 'localhost'
port = 9999
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((host, port))
s.listen(1)
print('\nListening for a client at', host, port)
conn, addr = s.accept()
print('\nConnected by', addr)

count = 0
lists = ['a','b']
try:
    while True:

        # wordcount发送的文件，随机5~15的数字

        line = [lists[random.randint(0, 1)] for x in range(2)]

        line = ",".join(line) + "\n"
        print(line)
        # socket只能发送byte编码的数据，所以设置编码 或者b'aaaaa'这样也行
        conn.send(line.encode('utf-8'))
        sleep(2)
        count += 1

        if count == 20:
            # 发送一个文字流包含字母的流
            # conn.send("a,b,c,d,e".encode('utf-8'))
            conn.close()


except socket.error:
    print('Error Occured.\n\nClient disconnected.\n')

# s.shutdown(socket.SHUT_RDWR)
# s.close()

pyspark streaming处理数据

# -*- coding: utf-8 -*-
# @Time    : 2020/7/3 14:52
# @Author  : ljk
# @Email   : ljk13572@163.com

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext('local','ljk_streaming')
ssc = StreamingContext(sc, 5)

lines = ssc.socketTextStream("localhost", 9999)

words = lines.flatMap(lambda line: line.split(","))
pairs = words.map(lambda word: (word, 1))

# Count each word in each batch

wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

ssc.start()  # Start the computation

try:
    ssc.awaitTerminationOrTimeout(30)
    ssc.stop(stopSparkContext=False, stopGraceFully=True)
except:
    print('input wrong')
    ssc.stop(stopSparkContext=False, stopGraceFully=True)