一、项目背景
1. 实时数据写入到 Kafka topic 中,经 Flume 批量采集到 HDFS 上。数据格式为标准 JSON 格式(不包含嵌套 JSON)。
2. 测试环境模拟数据的采集过程。测试集群为 第三方公司 基于当前主流开源组件自主研发并搭建的大数据平台,包含常用组件:HDFS,MapReduce,Yarn,Hive,HBase,Phoenix,Zookeeper,Spark,Impala,Flume,Sqoop,Kafka,Solr,Oozie,Hue,Redis
等。
3. 通过组件的使用,零编程。
以下过程记录本次模拟采集过程,包括操作步骤、所遇问题等。
二、创建模拟数据
创建 JSON 格式数据,通过 Kafka 自带的 producer 将数据写入到 topic 中,以备 Flume 消费。
{"gatheringTime":"2016-03-24","status":"有效","speed":"0.0 km/h","distance":"6000 km"}
{"gatheringTime":"2017-04-24","status":"无效","speed":"0.0 km/h","distance":"7000 km"}
{"gatheringTime":"2018-04-24","status":"有效","speed":"0.0 km/h","distance":"8000 km"}
{"gatheringTime":"2019-04-24","status":"无效","speed":"0.0 km/h","distance":"9000 km"}
{"gatheringTime":"2020-04-24","status":"有效","speed":"0.0 km/h","distance":"10000 km"}
{"gatheringTime":"2020-05-25","status":"有效","speed":"0.0 km/h","distance":"12000 km"}
{"gatheringTime":"2021-05-25","status":"有效","speed":"0.0 km/h","distance":"12000 km"}
{"gatheringTime":"2022-05-25","status":"有效","speed"