所用到的软件版本
spark2.3.0
idea2019.1
kafka_2.11-0.10.2.2
spark-streaming-kafka-0-10_2.11-2.3.0
设想是在win7系统下爬虫得到JSON数据存储到win7文件夹,利用共享文件,Centos7 mount共享文件,得到JSON数据,然后利用kafka自带的connect-file-source监听该文件:
bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
利用如下代码查看经过kafka产生的数据:
./bin/kafka-console-consumer.sh --bootstrap-server master:9092,slave1:9092,slave2:9092 --topic streaming_kafka --from-beginning
其数据格式如下:
{"schema":{"type":"string","optional":false},"payload":"{\"like_count\": 832, \"view_count\": 37210, \"user_name\": \" ֪ʶ \", \"play_url\": \"http://jsmov2.a.yximgs.com/upic/2019/04/12/19/A0MNc3NjIxXzJfMw==_b_B12594561fec10c99ab12c417bfbc8b7d.mp4?tag=1-1555243582-h-0-mznoh8fetl-6e60d4850f55979f\", \"description\": \" ٻ С֪ʶ \\n# л Ҫ \", \"cover\": \"http://ali2.a.yximgs.com/uhead/AB/2019/02/18/01/BjYxXzJfaGQ1NTZfNzg3_s.jpg\", \"video_id\": 5229242128224334952, \"comment_count\": 178, \"download_url\": \"http://txmov2-fallback.a.yximgs.com/upic/2019/04/12/19/BNDQxMjYxXzEyMTQ0ODc3NjIxXzJfMw==_