pyspark提交kafka任务缺少spark-streaming-kafka-0-8-assembly.jar报错解决

最新推荐文章于 2022-12-28 16:54:28 发布

走起的人生

最新推荐文章于 2022-12-28 16:54:28 发布

阅读量2.2k

点赞数 1

文章标签： kafka spark hadoop

本文链接：https://blog.csdn.net/qq_22905163/article/details/118556531

版权

pyspark提交kafka任务缺少spark-streaming-kafka-0-8-assembly.jar报错解决方案

1、开启kafka生产端

[root@hadoop102 ~]# kafka-console-producer --broker-list hadoop102:9092 --topic test1

2、pyspark接收脚本： test2.py

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

sc = SparkContext('local[*]','myapp')

ssc = StreamingContext(sc,5)

zkQuorum = 'hadoop102:2181'
groupId = 'kafka1'
topics = {'test1':1}
kafkaParams={'auto.offset.reset':'earliest'}
dstream=KafkaUtils.createStream(ssc,zkQuorum,groupId,topics).map(lambda t:t[1])

# 计算词频统计

result = dstream.flatMap(lambda line:line.split(' ')).map(lambda word:(word,1)).reduceByKey(lambda a,b:a+b)

# 输出
result.pprint()

#开启程序
ssc.start()
ssc.awaitTermination()

3、报错如下：

[root@hadoop102 ~]# spark-submit test2.py
21/07/07 20:18:21 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.

________________________________________________________________________________________________

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

1. Include the Kafka library and its dependencies with in the
spark-submit command as

$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.0-cdh6.2.1 ...

2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0-cdh6.2.1.
Then, include the jar in the spark-submit command as

$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

________________________________________________________________________________________________

Traceback (most recent call last):
File "/root/test2.py", line 13, in <module>
dstream=KafkaUtils.createStream(ssc,zkQuorum,groupId,topics).map(lambda t:t[1])
File "/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable

4、解决方案如上提示：

2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0-cdh6.2.1.
Then, include the jar in the spark-submit command as

$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

报错提示告知要到http://search.maven.org/去下载spark-streaming-kafka-0-8-assembly 对应版本号2.4.0的jar包

jar包下载地址：

https://download.csdn.net/download/qq_22905163/20068310?spm=1001.2014.3001.5501

5、我下载的是spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar

我的环境是CDH6.2.1，我把spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar /root/目录下进行指定jar提交

提示：之前把这个jar包放到了spark下面的jars/目录下面，以下test2.py的程序ok，但启动pyspark直接无法sparkSession初始化，报错

6、成功执行

[root@hadoop102 ~]# spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar test2.py