pyspark 读取kafka简单入门

1.安装环境

spark使用docker拉取镜像启动,docker pull bde2020/spark-master ,镜像说明,kafka根据网上的教程安装,之前的文档写过了不再赘述。

2.简单代码

#encoding=utf8
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import kafka
import json
offsets = []
def out_put(m):
    print(m)
def store_offset(rdd):
    global offsets
    offsets = rdd.offsetRanges()
    return rdd

def print_offset(rdd):
    for o in offsets:
        print "%s %s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset,o.untilOffset-o.fromOffset)


config = SparkConf()
scontext = SparkContext(appName='kafka_pyspark_test',)
stream_context = StreamingContext(scontext,2)
msg_stream = KafkaUtils.createDirectStream(stream_context,['asin_bsr_result',],kafkaParams={"metadata.broker.list": "3.120.9.44:9092,"})
result = msg_stream.map(lambda x :json.loads(x).keys()).reduce(out_put)
msg_stream.transform(store_offset,).foreachRDD(print_offset)
result.pprint()
stream_context.start()
stream_context.awaitTermination()


3.上传运行

切换到spark/bin目录下 执行 ./spark-submit --master spark://0.0.0.0:7077  test.py命令会遇到找不到jar包的错误

spark-streaming-kafka-0-8-assembly, Version = 2.3.1 到http://search.maven.org/下载相应的包到spark/jar的目录下就可以正常运行了,注意提交任务时的master地址是0.0.0.0 localhost,和127.0.0.1会报找不到服务器的错误。

运行后出现java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaRDDPartition的错误,需要手动指定jar包,下载spark-streaming-kafka-0-8_2.11-2.3.1.jar到jar目录下然后运行命令./spark-submit --packages --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1 \
 --master spark://0.0.0.0:7077 \
test.py但是又出现找不到主类的错误 Cannot load main class from JAR org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1 with URI org.apache.spark. Please specify a class through --class 

改用命令./spark-submit --class  sparkstreaming.KafkaStreaming\
 --jars ../jars/spark-streaming-kafka-0-8_2.11-2.3.1.jar\
 --master spark://0.0.0.0:7077 \
test.py报 java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition错误

最后./spark-submit --class  sparkstreaming.KafkaStreaming\
 --jars ../jars/spark-streaming-kafka-0-8_2.11-2.3.1.jar\
 --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1 \
 --master spark://0.0.0.0:7077 \
test.py 执行成功

./spark-submit --packages  org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1  --jars ../jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar --class sparkstreaming.KafkaStreaming  --master spark://0.0.0.0:7077 kafka_test.py #后面重建容器后发现这个才能执行成功,必须用spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar 这种代用 assembly的包才行,惨痛教训啊!

#encoding=utf8
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# from pyspark.sql import SparkSession
#
# # config = SparkConf().setMaster().setAppName()
# # spark_client = SparkSession().builder.config(conf=config).enableHiveSupport().getOrCreate()
# # test_csv_in = spark_client.read.csv('')
# # test_csv_in.coalesce(1).write.json('')
import redis
rclient = redis.Redis(host="172.17.0.7",port=6379)

def toredis(rdd):
    rdd.foreachPartition(lambda u:rclient.set(u,5))
    pass

sparkcontext = SparkContext("spark://0.0.0.0:7077","spark_test01")
sparkcontext.addPyFile('redis.zip')
ssc = StreamingContext(sparkcontext, 5)
kafka_strem_context = KafkaUtils.createStream(ssc,"172.17.0.6:2181","spark_test01",{"flumetest2":1},)
def out_data(y):
    print(y)
# kafka_strem_context = kafka_strem_context.map(lambda x: x[1])
#kafka_strem_context.map(lambda x: x[1]).flatMap(lambda x: (x.split("|@|")[0],x.split("|@|")[2])).reduceByKey(lambda a,b:a+b).foreachRDD(out_data)
kafka_strem_context.map(lambda x: x[1]).map(lambda x: (x.split("|@|")[0],x.split("|@|")[2])).reduceByKey(lambda a,b:a+b).foreachRDD(lambda q:q.foreachPartition(out_data))
kafka_strem_context.map(lambda x: x[1]).map(lambda x: (x.split("|@|")[0],x.split("|@|")[2])).reduceByKey(lambda a,b:a+b).foreachRDD(out_data)


ssc.start()
ssc.awaitTermination()

使用了redis写出数据,运行时会报错找不到redis.client,需要将site-package下面的redis模块打包后上传,同时在代码里添加依赖包,最后使用命令./spark-submit --packages  org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1  --jars ../jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar --class sparkstreaming.KafkaStreaming  --master spark://0.0.0.0:7077 kafka_test.py --py-files reids.zip提交任务

 

 

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值