spark读取数据并打印_SparkStreaming python 读取kafka数据将结果输出到单个指定本地文件...

最新推荐文章于 2023-09-02 17:39:03 发布

weixin_39611308

最新推荐文章于 2023-09-02 17:39:03 发布

阅读量455

点赞数

文章标签： spark读取数据并打印

本文链接：https://blog.csdn.net/weixin_39611308/article/details/113682798

版权

该博客介绍了一个使用Python的Spark Streaming从Kafka读取数据，解析消息并将其转换为地理位置信息的过程。数据经过处理后，结果被输出到指定的本地文件中。博客中详细展示了如何利用KafkaUtils创建DStream，以及如何使用自定义函数进行数据提取和转换。

摘要由CSDN通过智能技术生成

#-*- coding: UTF-8 -*-#!/bin/env python3

#filename readFromKafkaStreamingGetLocation.py

importIPfrom pyspark importSparkContextfrom pyspark.streaming importStreamingContextfrom pyspark.streaming.kafka importKafkaUtilsimportdatetimeclassKafkaMessageParse:defextractFromKafka(self,kafkainfo):if type(kafkainfo) is tuple and len(kafkainfo) == 2:return kafkainfo[1]deflineFromLines(self,lines):if lines is not None and len(lines) >0:return lines.strip().split("\n")defmessageFromLine(self,line):if line is not None and "message" inline.keys():return line.get("message")defip2location(self,ip):

result=[]

country= ‘country‘province= ‘province‘city= ‘city‘ipinfo=IP.find(ip.strip())try:

location= ipinfo.split("\t")if len(location) == 3:

country=location[0]

province= location[1]

city= location[2]elif len(location) == 2:

country=location[0]

province= location[1]else:pass

exceptException:passresult.append(ip)

result.append(country)

result.append(province)

result.append(city)returnresultdefvlistfromkv(self, strori, sep1, sep2):

resultlist=[]

fields=strori.split(sep1)for field infields:

kv=field.split(sep2)

resultlist.append(kv[1])returnresultlistdefextractFromMessage(self, message):if message is not None and len(message) > 1:if len(message.split("\u0001")) == 8:

resultlist= self.vlistfromkv(message, "\x01", "\x02")

source=resultlist.pop()

ip=resultlist.pop()

resultlist.extend(self.ip2location(ip))

resultlist.append(source)

result= "\x01".join(resultlist)returnresultdef tpprint(val, num=10000):

"""

Print the first num elements of each RDD generated in this DStream.

@param num: the number of elements from the first will be printed.

"""

def takeAndPrint(time, rdd):

taken = rdd.take(num + 1)