本系列内容:
- Kafka环境搭建与测试
- Python生产者/消费者测试
- Spark接收Kafka消息处理,然后回传到Kafka
- Flask引入消费者
- WebSocket实时显示
版本:
spark-2.4.3-bin-hadoop2.7.tgz
kafka_2.11-2.1.0.tgz
------------------第3小节:Spark接收Kafka消息处理,然后回传到Kafka--------------------
import sys
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, concat, array_join, concat_ws
from pyspark.sql.functions import split
from pyspark.sql.functions import window
if __name__ == "__main__":
# broker地址
bootstrapServers = "192.168.147.128:9092"
# subscribe:订阅
subscribeType = "subscribe"
# 主题
topics = "bigdata"
# 窗口大小:30秒
windowSize = 30
# 滑动窗口大小:15秒
slideSize = 15
windowDuration = '{} seconds'.format(windowSize)
slideDuration = '{} seconds'.format(slideSize)
spark = SparkSession.builder.appName("KafkaWordCount").getOrCreate()
# 读取流数据,并生成dataframe
# spark获取到的流数据,将放到这个dataframe中的value列
# dataframe包含:key、value、topic、partition、offset、timestamp、timestampType
# 这些列成为dataframe元素数据
# value:是二进制的字节数组,在使用时需要转为字符串
lines = spark