pyspark提交kafka任务缺少spark-streaming-kafka-0-8-assembly.jar报错解决

pyspark提交kafka任务缺少spark-streaming-kafka-0-8-assembly.jar报错解决方案

1、开启kafka生产端

[root@hadoop102 ~]# kafka-console-producer --broker-list hadoop102:9092 --topic test1
 

2、pyspark接收脚本: test2.py

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

sc = SparkContext('local[*]','myapp')

ssc = StreamingContext(sc,5)

zkQuorum = 'hadoop102:2181'
groupId = 'kafka1'
topics = {'test1':1}
kafkaParams={'auto.offset.reset':'earliest'}
dstream=KafkaUtils.createStream(ssc,zkQuorum,groupId,topics).map(lambda t:t[1])


# 计算词频统计

result = dstream.flatMap(lambda line:line.split(' ')).map(lambda word:(word,1)).reduceByKey(lambda a,b:a+b)

# 输出
result.pprint()

#开启程序
ssc.start()
ssc.awaitTermination()
 

3、报错如下:

[root@hadoop102 ~]# spark-submit test2.py 
21/07/07 20:18:21 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.

________________________________________________________________________________________________

  Spark Streaming's Kafka libraries not found in class path. Try one of the following.

  1. Include the Kafka library and its dependencies with in the
     spark-submit command as

     $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.0-cdh6.2.1 ...

  2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
     Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0-cdh6.2.1.
     Then, include the jar in the spark-submit command as

     $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

________________________________________________________________________________________________


Traceback (most recent call last):
  File "/root/test2.py", line 13, in <module>
    dstream=KafkaUtils.createStream(ssc,zkQuorum,groupId,topics).map(lambda t:t[1])
  File "/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream
  File "/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper
TypeError: 'JavaPackage' object is not callable

 4、解决方案如上提示:

2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
     Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0-cdh6.2.1.
     Then, include the jar in the spark-submit command as

     $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

报错提示告知要到http://search.maven.org/去下载spark-streaming-kafka-0-8-assembly 对应版本号2.4.0的jar包

jar包下载地址:

https://download.csdn.net/download/qq_22905163/20068310?spm=1001.2014.3001.5501

5、我下载的是spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar

我的环境是CDH6.2.1,我把spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar /root/目录下进行指定jar提交

提示:之前把这个jar包放到了spark下面的jars/目录下面,以下test2.py的程序ok,但启动pyspark直接无法sparkSession初始化,报错

6、成功执行

[root@hadoop102 ~]# spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.4.0.jar test2.py

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
Spark Streaming 可以通过 Kafka Direct API 或 Receiver API 来实时消费 Kafka 上的数据。 使用 Kafka Direct API 的方式,需要引入 spark-streaming-kafka 相关的依赖,然后创建 Kafka Direct Stream,并指定 Kafka 的参数和消费的 topic。 示例代码如下: ```scala import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} val ssc = new StreamingContext(sparkConf, Seconds(5)) val kafkaParams = Map[String, String]("bootstrap.servers" -> "localhost:9092") val topics = Set("test") val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topics) stream.map(record => (record._1, record._2)).print() ssc.start() ssc.awaitTermination() ``` 使用 Kafka Receiver API 的方式,需要引入 spark-streaming-kafka 相关的依赖,然后创建 Kafka Receiver Stream,并指定 Kafka 的参数和消费的 topic。 示例代码如下: ```scala import org.apache.spark.streaming.kafka.KafkaUtils import org.apache.spark.streaming.{Seconds, StreamingContext} val ssc = new StreamingContext(sparkConf, Seconds(5)) val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092") val topics = Set("test") val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topics, StorageLevel.MEMORY_ONLY_SER) stream.map(record => (record._1, record._2)).print() ssc.start() ssc.awaitTermination() ``` 需要注意的是,使用 Receiver API 的方式可能会有数据丢失的问题,因此建议使用 Direct API。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值