我采用的是Log4j+Flume+Kafka来收集实时数据,使用SparkStreaming读取Kafka中的数据,进行实时处理。
严格来讲,SparkStreaming是准实时计算,因为它在读取Kafka数据时,是根据设置的时间间隔分批次去读取。只是这个时间间隔可以设置的很小,可以接近于达到实时的效果。
接下来我们一步步看下实现步骤。
- 在工程中引入Flume的包。
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.9.0</version>
<exclusions>
<exclusion>
<artifactId>log4j</artifactId>
<groupId>log4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-flume-ng</artifactId>
<version>2.11.2</version>
<exclusions>
<exclusion>
<artifactId>log4j-core</artifactId>
<groupId>org.apache.logging.log4j</groupId>
</exclusion>
<exclusion>
<artifactId>log4j-api</artifactId>
<groupId>org.apache.logging.log4j</groupId>
</exclusion>
</exclusions>
</dependency>
- 配置log4j
log4j.logger.logappendername = DEBUG,logappendername (自定义的日志名)
log4j.appender.logappendername = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.logappendername.Hostname = 172.16.1.1 (flume服务器的ip)
log4j.appender.logappendername.Port = 44444 (flume中对应的端口)
log4j.appender.logappendername.AvroReflectionEnabled = true
log4j.appender.logappendername.AvroSchemaUrl = hdfs://namenode/path/to/schema.avsc
log4j.appender.logappendername.layout=org.apache.log4j.PatternLayout
log4j.additivity.logappendername=false
将业务操作过程中,产生的需要记录的数据,通过日志进行记录
private static Logger logger = Logger.getLogger("logappendername");
public void operation(){
logger.debug("xxxx");
}
其中logappendername,与log4就中的名称对应。
- 配置Flume
flumedata.sources = avro-source
flumedata.sinks = kafka-sink
flumedata.channels = memory-channel
flumedata.sources.avro-source.type = avro
flumedata.sources.avro-source.bind = 172.16.1.1 (flume服务器ip)
flumedata.sources.avro-source.port = 44444 (flume端口,与log4j中对应)
duration.channels.memory-channel.type = memory
flumedata.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
flumedata.sinks.kafka-sink.brokerList = 172.16.1.2:9092 (kafka的ip和端口)
flumedata.sinks.kafka-sink.topic = kafkatopicname (kafka队列名,在Sparkstreaming程序中会使用到)
flumedata.sinks.kafka-sink.batchSize = 1
flumedata.sinks.kafka-sink.requiredAcks = 1
flumedata.sources.avro-source.channels = memory-channel
flumedata.sinks.kafka-sink.channel = memory-channel
- 开发SparkStreaming任务
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf().setMaster("local[*]")
.setAppName("TestApp");
//指定间隔时间是10秒,每隔10秒从kafka取一次数据
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
Map<String, Object> kafkaParams = new HashMap<>();
//Kafka服务监听端口
kafkaParams.put("bootstrap.servers", args[1]);
//指定kafka输出key的数据类型及编码格式(默认为字符串类型编码格式为uft-8)
kafkaParams.put("key.deserializer", StringDeserializer.class);
//指定kafka输出value的数据类型及编码格式(默认为字符串类型编码格式为uft-8)
kafkaParams.put("value.deserializer", StringDeserializer.class);
try {
Collection<String> topics = Arrays.asList("kafkatopicname");
//groupid 可以自己生成,注意不能重复
kafkaParams.put("group.id", "group1");
JavaInputDStream<ConsumerRecord<String, String>> logs = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topics, kafkaParams)
);
//flatMap 将本批次整理为按行输出的结果
JavaDStream<String> lines =
logs.flatMap(line -> {
String lineValue = line.value();
return Arrays.asList(lineValue).iterator();
});
//将数据逐行入库,这里没有进行负责计算,只是个简单的例子
//这里可以用到spark的各种算子,对读取的数据进行计算合并汇总等操作后再进行入库
lines.foreachRDD(rdd -> {
rdd.foreachPartition(partitionRecords -> {
List<String> logList = new ArrayList<String>();
partitionRecords.forEachRemaining(line -> {
logList.add(line);
});
//在此处可以调用dao方法,将结果入库
});
});
jssc.start();
jssc.awaitTermination();
}catch(Exception e) {
e.printStackTrace();
}
}
一个简单的通过SparkStreaming实时处理日志数据的demo就完成了。