上一篇我们讲到了druid可以监听kafka的topic实时导入数据,下面我们讲一下如何通过flume实时讲本地的日志文件导入kafka中。
flume是一个日志采集系统,可以通过不同方式收集数据,并做一些预处理,然后传输到下游的数据管道中,比如hdfs或者kafka中。因为druid中的字段最好是简单类型,方便进行分析,而我们现在上报的日志都是通过一个ext_data的json字段将关键字段存储其中,因此需要手动写一个flume的拦截器,将ext_data中的字段解析出来。自己写的一个拦截器代码如下:
1package com.test.intercepter;
2
3import com.google.common.collect.Lists;
4import org.apache.flume.Context;
5import org.apache.flume.Event;
6import org.apache.flume.interceptor.Interceptor;
7import org.codehaus.jackson.JsonNode;
8import org.codehaus.jackson.map.ObjectMapper;
9import org.codehaus.jackson.node.ObjectNode;
10
11import java.io.ByteArrayInputStream;
12import java.util.Iterator;
13import java.util.List;
14import java.util.Map;
15
16public class ExtraIntercepter implements Interceptor {
17 private ExtraIntercepter() {
18
19 }
20
21 @Override
22 public void close() {
23
24 }
25
26 @Override
27 public void initialize() {
28
29 }
30
31 @Override
32 public Event intercept(Event event) {
33 try {
34 ObjectMapper mapper = new ObjectMapper();
35 JsonNode node = mapper.readTree(new ByteArrayInputStream(event.getBody()));
36 ObjectNode objectNode = (ObjectNode) node;
37 boolean flag = false;
38 if (objectNode.has("uid")) { //为了保留原有的uid字段
39 objectNode.put("copy_uid", objectNode.get("uid"));
40 flag = true;
41 }
42
43 String ej = objectNode.get("ext_data").getTextValue();
44 JsonNode extData = mapper.readTree(ej);
45 if (extData.isObject()) {
46 Iterator<Map.Entry<String, JsonNode>> jsonNodes = extData.getFields();
47 while (jsonNodes.hasNext()) {
48 Map.Entry<String, JsonNode> n = jsonNodes.next();
49 objectNode.put(n.getKey(), n.getValue());
50 }
51 objectNode.remove("ext_data");
52 flag = true;
53 }
54 if (flag) {
55 String jsonstr = mapper.writeValueAsString(objectNode);
56 event.setBody(jsonstr.getBytes());
57 }
58 } catch (Exception e) {
59 // no-op
60 e.printStackTrace();
61 return null;
62 }
63
64 return event;
65 }
66
67 @Override
68 public List<Event> intercept(List<Event> events) {
69 List<Event> out = Lists.newArrayList();
70 Iterator i$ = events.iterator();
71 while (i$.hasNext()) {
72 Event event = (Event) i$.next();
73 Event outEvent = this.intercept(event);
74 if (outEvent != null) {
75 out.add(outEvent);
76 }
77 }
78 return out;
79 }
80
81 public static class ExtraBuilder implements Interceptor.Builder{
82
83 @Override
84 public void configure(Context context) {
85
86 }
87
88 @Override
89 public Interceptor build() {
90 return new ExtraIntercepter();
91 }
92
93 }
94}
通过intellij IDEA中maven加载依赖:
然后点击Build -> Build Artifacts打包,打包后将jar包放到flume的lib目录下。
配置一个flume的agent配置文件:flume-ad-test.conf如下,主要就是监听一个目录下所有的文件变化,然后进过预处理后写入下游的kafka的topic中:
1a1.sources = r1
2a1.sinks = k1
3a1.channels = c1
4
5a1.sources.r1.type = TAILDIR
6# 元数据位置
7a1.sources.r1.positionFile = /usr/local/flume-1.9.0/data/ad_test_taildir_position.json
8# 监控的目录
9a1.sources.r1.filegroups = f1
10a1.sources.r1.filegroups.f1=/var/log/ad-log/.*log
11a1.sources.r1.fileHeader = true
12
13a1.sources.r1.interceptors=i1
14a1.sources.r1.interceptors.i1.type=com.test.intercepter.ExtraIntercepter$ExtraBuilder
15
16# Describe the sink
17a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
18a1.sinks.k1.kafka.topic = ad-test
19a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
20a1.sinks.k1.kafka.flumeBatchSize = 20
21a1.sinks.k1.kafka.producer.acks = 1
22a1.sinks.k1.kafka.producer.linger.ms = 1
23a1.sinks.k1.kafka.producer.compression.type = snappy
24
25# Use a channel which buffers events in memory
26a1.channels.c1.type = memory
27a1.channels.c1.capacity = 100000
28a1.channels.c1.transactionCapacity = 100000
29
30# Bind the source and sink to the channel
31a1.sources.r1.channels = c1
32a1.sinks.k1.channel = c1
进入flume安装目录,启动命令如下:
nohup bin/flume-ng agent --conf conf --conf-file conf/flume-ad-test.conf --name a1 &
在网页上Query tab下就可以看到数据源了。