1.安装环境
spark使用docker拉取镜像启动,docker pull bde2020/spark-master ,镜像说明,kafka根据网上的教程安装,之前的文档写过了不再赘述。
2.简单代码
#encoding=utf8
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import kafka
import json
offsets = []
def out_put(m):
print(m)
def store_offset(rdd):
global offsets
offsets = rdd.offsetRanges()
return rdd
def print_offset(rdd):
for o in offsets:
print "%s %s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset,o.untilOffset-o.fromOffset)
config = SparkConf()
scontext = SparkContext(appName='kafka_pyspark_test',)
stream_context = StreamingContext(scontext,2)
msg_stream = KafkaUtils.createDirectStream(stream_context,['asin_bsr_result',],kafkaParams={"metadata.broker.list": "3.120.9.44:9092,"})
result = msg_stream.map(lambda x :json.loads(x).keys()).reduce(out_put)
msg_stream.transform(store_offset,).foreachRDD(print_offset)
result.pprint()
stream_context.start()
stream_context.awaitTermination()
3.上传运行
切换到spark/bin目录下 执行 ./spark-submit --master spark://0.0.0.0:7077 test.py命令会遇到找不到jar包的错误
spark-streaming-kafka-0-8-assembly, Version = 2.3.1 到http://search.maven.org/下载相应的包到spark/jar的目录下就可以正常运行了,注意提交任务时的master地址是0.0.0.0 localhost,和127.0.0.1会报找不到服务器的错误。
运行后出现java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaRDDPartition的错误,需要手动指定jar包,下载spark-streaming-kafka-0-8_2.11-2.3.1.jar到jar目录下然后运行命令./spark-submit --packages --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1 \
--master spark://0.0.0.0:7077 \
test.py但是又出现找不到主类的错误 Cannot load main class from JAR org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1 with URI org.apache.spark. Please specify a class through --class
改用命令./spark-submit --class sparkstreaming.KafkaStreaming\
--jars ../jars/spark-streaming-kafka-0-8_2.11-2.3.1.jar\
--master spark://0.0.0.0:7077 \
test.py报 java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition错误
最后./spark-submit --class sparkstreaming.KafkaStreaming\
--jars ../jars/spark-streaming-kafka-0-8_2.11-2.3.1.jar\
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1 \
--master spark://0.0.0.0:7077 \
test.py 执行成功
./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1 --jars ../jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar --class sparkstreaming.KafkaStreaming --master spark://0.0.0.0:7077 kafka_test.py #后面重建容器后发现这个才能执行成功,必须用spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar 这种代用 assembly的包才行,惨痛教训啊!
#encoding=utf8
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
# from pyspark.sql import SparkSession
#
# # config = SparkConf().setMaster().setAppName()
# # spark_client = SparkSession().builder.config(conf=config).enableHiveSupport().getOrCreate()
# # test_csv_in = spark_client.read.csv('')
# # test_csv_in.coalesce(1).write.json('')
import redis
rclient = redis.Redis(host="172.17.0.7",port=6379)
def toredis(rdd):
rdd.foreachPartition(lambda u:rclient.set(u,5))
pass
sparkcontext = SparkContext("spark://0.0.0.0:7077","spark_test01")
sparkcontext.addPyFile('redis.zip')
ssc = StreamingContext(sparkcontext, 5)
kafka_strem_context = KafkaUtils.createStream(ssc,"172.17.0.6:2181","spark_test01",{"flumetest2":1},)
def out_data(y):
print(y)
# kafka_strem_context = kafka_strem_context.map(lambda x: x[1])
#kafka_strem_context.map(lambda x: x[1]).flatMap(lambda x: (x.split("|@|")[0],x.split("|@|")[2])).reduceByKey(lambda a,b:a+b).foreachRDD(out_data)
kafka_strem_context.map(lambda x: x[1]).map(lambda x: (x.split("|@|")[0],x.split("|@|")[2])).reduceByKey(lambda a,b:a+b).foreachRDD(lambda q:q.foreachPartition(out_data))
kafka_strem_context.map(lambda x: x[1]).map(lambda x: (x.split("|@|")[0],x.split("|@|")[2])).reduceByKey(lambda a,b:a+b).foreachRDD(out_data)
ssc.start()
ssc.awaitTermination()
使用了redis写出数据,运行时会报错找不到redis.client,需要将site-package下面的redis模块打包后上传,同时在代码里添加依赖包,最后使用命令./spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.1 --jars ../jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar --class sparkstreaming.KafkaStreaming --master spark://0.0.0.0:7077 kafka_test.py --py-files reids.zip提交任务