我的版本:
- linux:16.04
- kafka:0.10
- scala:2.11
- spark:2.4.7
- jar包:spark-streaming-kafka-0-8_2.11-2.4.7.jar
在尝试运行spark用直连的方式接收kafka的数据时,出现报错,python代码如下(仅测试环境,可忽略):
#spark通过直连的方式接收kafka数据,仅以测试环境是否可用为目的
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def start():
spark_conf = SparkConf().setMaster("local").setAppName("KafkaDirect")
sc = SparkContext(conf=spark_conf)
sc.setLogLevel("WARN")
ssc=StreamingContext(sc,1)
brokers="localhost:9092"
topic='ratingTopic'
kafkaStreams = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams={"metadata.broker.list": brokers})
#kafkaStreams = KafkaUtils.createStream(ssc,brokers,1,{topic:1})
#统计生成的随机数的分布情况
result=kafkaStreams.map(lambda x:(x[0],1)).reduceByKey(lambda x, y: x + y)
#打印offset的情况,此处也可以写到Zookeeper中
#You can use transform() instead of foreachRDD() as your
# first method call in order to access offsets, then call further Spark methods.
kafkaStreams.transform(storeOffsetRanges).foreac