pyspark提交kafka任务缺少spark-streaming-kafka-0-8-assembly.jar报错解决方案
1、开启kafka生产端
[root@hadoop102 ~]# kafka-console-producer --broker-list hadoop102:9092 --topic test1
2、pyspark接收脚本: test2.py
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtilssc = SparkContext('local[*]','myapp')
ssc = StreamingContext(sc,5)
zkQuorum = 'hadoop102:2181'
groupId = 'kafka1'
topics = {'test1':1}
kafkaParams={'auto.offset.reset':'earliest'}
dstream=KafkaUtils.createStream(ssc,zkQuorum,groupId,topics).map(lambda t:t[1])
# 计算词频统计result = dstream.flatMap(lambda line:line.split(' ')).map(lambda word:(word,1)).reduceByKey(lambda a,b:a+b)
# 输出
result.pprint()#开启程序
ssc.start()
ssc.awaitTermination()
3、报错如下:
[root@hadoop102 ~]# spark-submit test2.py
21/07/07 20:18:21 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.0-cdh6.2.1 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0-cdh6.2.1.
Then, include the jar in the spark-submit command as$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
________________________________________________________________________________________________
Traceback (most recent call last):
File "/root/test2.py", line 13, in <module>
dstream=KafkaUtils.createStream(ssc,zkQuorum,groupId,topics).map(lambda t:t[1])
File "/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
File "/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable
4、解决方案如上提示:
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0-cdh6.2.1.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
报错提示告知要到http://search.maven.org/去下载spark-streaming-kafka-0-8-assembly 对应版本号2.4.0的jar包
jar包下载地址:
https://download.csdn.net/download/qq_22905163/20068310?spm=1001.2014.3001.5501
5、我下载的是spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar
我的环境是CDH6.2.1,我把spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar /root/目录下进行指定jar提交
提示:之前把这个jar包放到了spark下面的jars/目录下面,以下test2.py的程序ok,但启动pyspark直接无法sparkSession初始化,报错
6、成功执行
[root@hadoop102 ~]# spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar test2.py