Flume 以twitter为source,kafka为channel,hdfs为sink,再用spark streaming 读kafka topic
Flume的配置文件: kafka_twitter.conf
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = kafka-channel
TwitterAgent.sinks = sink1
# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = 7PPYKH38pXjxdTCMR2gW7idoZ
TwitterAgent.sources.Twitter.consumerSecret = JHaymz2hrb0E95AZBERRYDFPCLhewVdzCkVT1Ws1ZORh3uuOpJ
TwitterAgent.sources.Twitter.accessToken = 2853850382-G876Yy7oSiwFDL3KFiewSuZiIHqUS7BXQ5WOg2v
TwitterAgent.sources.Twitter.accessTokenSecret = Y1tb155NjjJUaM8TNgA9E71GFseYGfZ8VyVEOjDJJ0CsP
TwitterAgent.sources.Twitter.keywords = Trump
TwitterAgent.sources.Twitter.channels = kafka-channel
# Describing/Configuring the sink
TwitterAgent.sinks.sink1.type = hdfs
TwitterAgent.sinks.sink1.hdfs.path = hdfs://serveur-hadoop.hadoop.com:8020/user/ychen/kafka/%{topic}/%y-%m-%d
TwitterAgent.sinks.sink1.hdfs.rollInterval = 5
TwitterAgent.sinks.sink1.hdfs.rollSize = 0
TwitterAgent.sinks.sink1.hdfs.rollCount = 0
TwitterAgent.sinks.sink1.hdfs.fileType = DataStream
TwitterAgent.sinks.sink1.channel = kafka-channel
# Describing/Configuring the channel
TwitterAgent.channels.kafka-channel.type = org.apache.flume.channel.kafka.KafkaChannel
TwitterAgent.channels.kafka-channel.capacity = 10000
TwitterAgent.channels.kafka-channel.transactionCapacity = 100
TwitterAgent.channels.kafka-channel.brokerList = serveur-hadoop.hadoop.com:9092
TwitterAgent.channels.kafka-channel.topic = twitter
TwitterAgent.channels.kafka-channel.zookeeperConnect = 147.135.135.51:2181
TwitterAgent.channels.kafka-channel.parseAsFlumeEvent = true
spark streaming kafka_counting.py
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
if __name__ == "__main__":
if len(sys.argv) != 3:
print "Usage: kafka_count.py <zk> <topic>"
exit(-1)
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
sc.setLogLevel('WARN')
ssc = StreamingContext(sc, 5)
zkQuorum, topic = sys.argv[1:] # get host and topicname from the commend
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
parsed = kvs.map(lambda x: json.loads(x[1]))
parsed.saveAsTextFiles("twitter_test_x")
parsed.count().map(lambda x:'Tweets in this batch:%s' %x).pprint()
ssc.start()
ssc.awaitTermination()
两个文件单独运行都没有问题,即单独运行flume的时候,程序很正常,而spark streaming读取手动输入信息的producer topic也是可以的。但是当我用spark streaming 连接twitter topic的时候,总是出现 unicodedecodeerror: ‘utf8’ can not decode byte….
第一反应是twitter里有些奇怪的字符,但是查看twitter文件挺正常的,而且pyspark 通过 KafkaUtils.createStream 直接读取topic,pyspark是没有地方可以encode的。查看flume,也没有什么地方异常,纠结了一天也找不到可以改的地方。
后来仔细读了kafka flume的文档 ,仔细研究了一下kafka_channel 的定义,其中有一项值得引起注意。
|parseAsFlumeEvent | true | : Set to true if a Flume source is writing to the channel and expects AvroDataums with the FlumeEvent schema (org.apache.flume.source.avro.AvroFlumeEvent) in the channel. Set to false if other producers are writing to the topic that the channel is using.
抱着试一试的想法就把这一项改了,改成了false,惊喜!然后我的pyspark程序就可以读kafka里的twitter topic了。但是谁能告诉我为什么呢!
不过channel好了,sink又出现问题了,说是没有timestamp,这个问题很好解决,在flume配置文件的hdfs sink里添加useLocalTimeStamp = true就好了。
还有一个问题是。在读取twitter的时候,用到了parsed = kvs.map(lambda x: json.loads(x[1])), 不懂这里为什么用的是x[1]? 只写json.loads(x)就不行。
ps: 写pyspark的教程可以看这里。
getting-started-with-spark-streaming-with-python-and-kafka/