flume学习(九):自定义拦截器

还是针对学习八中的那个需求,我们现在换一种实现方式,采用拦截器来实现。

先回想一下,spooldir source可以将文件名作为header中的key:basename写入到event的header当中去。试想一下,如果有一个拦截器可以拦截这个event,然后抽取header中这个key的值,将其拆分成3段,每一段都放入到header中,这样就可以实现那个需求了。

遗憾的是,flume没有提供可以拦截header的拦截器。不过有一个抽取body内容的拦截器:RegexExtractorInterceptor,看起来也很强大,以下是一个官方文档的示例:

If the Flume event body contained 1:2:3.4foobar5 and the following configuration was used


a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
a1.sources.r1.interceptors.i1.serializers.s1.name = one
a1.sources.r1.interceptors.i1.serializers.s2.name = two
a1.sources.r1.interceptors.i1.serializers.s3.name = three
The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3

大概意思就是,通过这样的配置,event body中如果有1:2:3.4foobar5 这样的内容,这会通过正则的规则抽取具体部分的内容,然后设置到header当中去。


于是决定打这个拦截器的主义,觉得只要把代码稍微改改,从拦截body改为拦截header中的具体key,就OK了。翻开源码,哎呀,很工整,改起来没难度,以下是我新增的一个拦截器:RegexExtractorExtInterceptor:

package com.besttone.flume;

import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer;
import org.apache.flume.interceptor.RegexExtractorInterceptorSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.google.common.base.Charsets;
import com.google.common.base.Preconditions;
import com.google.common.base.Throwables;
import com.google.common.collect.Lists;

/**
 * Interceptor that extracts matches using a specified regular expression and
 * appends the matches to the event headers using the specified serializers</p>
 * Note that all regular expression matching occurs through Java's built in
 * java.util.regex package</p>. Properties:
 * <p>
 * regex: The regex to use
 * <p>
 * serializers: Specifies the group the serializer will be applied to, and the
 * name of the header that will be added. If no serializer is specified for a
 * group the default {@link RegexExtractorInterceptorPassThroughSerializer} will
 * be used
 * <p>
 * Sample config:
 * <p>
 * agent.sources.r1.channels = c1
 * <p>
 * agent.sources.r1.type = SEQ
 * <p>
 * agent.sources.r1.interceptors = i1
 * <p>
 * agent.sources.r1.interceptors.i1.type = REGEX_EXTRACTOR
 * <p>
 * agent.sources.r1.interceptors.i1.regex = (WARNING)|(ERROR)|(FATAL)
 * <p>
 * agent.sources.r1.interceptors.i1.serializers = s1 s2
 * agent.sources.r1.interceptors.i1.serializers.s1.type =
 * com.blah.SomeSerializer agent.sources.r1.interceptors.i1.serializers.s1.name
 * = warning agent.sources.r1.interceptors.i1.serializers.s2.type =
 * org.apache.flume.interceptor.RegexExtractorInterceptorTimestampSerializer
 * agent.sources.r1.interceptors.i1.serializers.s2.name = error
 * agent.sources.r1.interceptors.i1.serializers.s2.dateFormat = yyyy-MM-dd
 * </code>
 * </p>
 * 
 * <pre>
 * Example 1:
 * </p>
 * EventBody: 1:2:3.4foobar5</p> Configuration:
 * agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
 * </p>
 * agent.sources.r1.interceptors.i1.serializers = s1 s2 s3
 * agent.sources.r1.interceptors.i1.serializers.s1.name = one
 * agent.sources.r1.interceptors.i1.serializers.s2.name = two
 * agent.sources.r1.interceptors.i1.serializers.s3.name = three
 * </p>
 * results in an event with the the following
 * 
 * body: 1:2:3.4foobar5 headers: one=>1, two=>2, three=3
 * 
 * Example 2:
 * 
 * EventBody: 1:2:3.4foobar5
 * 
 * Configuration: agent.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
 * <p>
 * agent.sources.r1.interceptors.i1.serializers = s1 s2
 * agent.sources.r1.interceptors.i1.serializers.s1.name = one
 * agent.sources.r1.interceptors.i1.serializers.s2.name = two
 * <p>
 * 
 * results in an event with the the following
 * 
 * body: 1:2:3.4foobar5 headers: one=>1, two=>2
 * </pre>
 */
public class RegexExtractorExtInterceptor implements Interceptor {

	static final String REGEX = "regex";
	static final String SERIALIZERS = "serializers";

	// 增加代码开始

	static final String EXTRACTOR_HEADER = "extractorHeader";
	static final boolean DEFAULT_EXTRACTOR_HEADER = false;
	static final String EXTRACTOR_HEADER_KEY = "extractorHeaderKey";

	// 增加代码结束

	private static final Logger logger = LoggerFactory
			.getLogger(RegexExtractorExtInterceptor.class);

	private final Pattern regex;
	private final List<NameAndSerializer> serializers;

	// 增加代码开始

	private final boolean extractorHeader;
	private final String extractorHeaderKey;

	// 增加代码结束

	private RegexExtractorExtInterceptor(Pattern regex,
			List<NameAndSerializer> serializers, boolean extractorHeader,
			String extractorHeaderKey) {
		this.regex = regex;
		this.serializers = serializers;
		this.extractorHeader = extractorHeader;
		this.extractorHeaderKey = extractorHeaderKey;
	}

	@Override
	public void initialize() {
		// NO-OP...
	}

	@Override
	public void close() {
		// NO-OP...
	}

	@Override
	public Event intercept(Event event) {
		String tmpStr;
		if(extractorHeader)
		{
			tmpStr = event.getHeaders().get(extractorHeaderKey);
		}
		else
		{
			tmpStr=new String(event.getBody(),
					Charsets.UTF_8);
		}
		
		Matcher matcher = regex.matcher(tmpStr);
		Map<String, String> headers = event.getHeaders();
		if (matcher.find()) {
			for (int group = 0, count = matcher.groupCount(); group < count; group++) {
				int groupIndex = group + 1;
				if (groupIndex > serializers.size()) {
					if (logger.isDebugEnabled()) {
						logger.debug(
								"Skipping group {} to {} due to missing serializer",
								group, count);
					}
					break;
				}
				NameAndSerializer serializer = serializers.get(group);
				if (logger.isDebugEnabled()) {
					logger.debug("Serializing {} using {}",
							serializer.headerName, serializer.serializer);
				}
				headers.put(serializer.headerName, serializer.serializer
						.serialize(matcher.group(groupIndex)));
			}
		}
		return event;
	}

	@Override
	public List<Event> intercept(List<Event> events) {
		List<Event> intercepted = Lists.newArrayListWithCapacity(events.size());
		for (Event event : events) {
			Event interceptedEvent = intercept(event);
			if (interceptedEvent != null) {
				intercepted.add(interceptedEvent);
			}
		}
		return intercepted;
	}

	public static class Builder implements Interceptor.Builder {

		private Pattern regex;
		private List<NameAndSerializer> serializerList;

		// 增加代码开始

		private boolean extractorHeader;
		private String extractorHeaderKey;

		// 增加代码结束

		private final RegexExtractorInterceptorSerializer defaultSerializer = new RegexExtractorInterceptorPassThroughSerializer();

		@Override
		public void configure(Context context) {
			String regexString = context.getString(REGEX);
			Preconditions.checkArgument(!StringUtils.isEmpty(regexString),
					"Must supply a valid regex string");

			regex = Pattern.compile(regexString);
			regex.pattern();
			regex.matcher("").groupCount();
			configureSerializers(context);

			// 增加代码开始
			extractorHeader = context.getBoolean(EXTRACTOR_HEADER,
					DEFAULT_EXTRACTOR_HEADER);

			if (extractorHeader) {
				extractorHeaderKey = context.getString(EXTRACTOR_HEADER_KEY);
				Preconditions.checkArgument(
						!StringUtils.isEmpty(extractorHeaderKey),
						"必须指定要抽取内容的header key");
			}
			// 增加代码结束
		}

		private void configureSerializers(Context context) {
			String serializerListStr = context.getString(SERIALIZERS);
			Preconditions.checkArgument(
					!StringUtils.isEmpty(serializerListStr),
					"Must supply at least one name and serializer");

			String[] serializerNames = serializerListStr.split("\\s+");

			Context serializerContexts = new Context(
					context.getSubProperties(SERIALIZERS + "."));

			serializerList = Lists
					.newArrayListWithCapacity(serializerNames.length);
			for (String serializerName : serializerNames) {
				Context serializerContext = new Context(
						serializerContexts.getSubProperties(serializerName
								+ "."));
				String type = serializerContext.getString("type", "DEFAULT");
				String name = serializerContext.getString("name");
				Preconditions.checkArgument(!StringUtils.isEmpty(name),
						"Supplied name cannot be empty.");

				if ("DEFAULT".equals(type)) {
					serializerList.add(new NameAndSerializer(name,
							defaultSerializer));
				} else {
					serializerList.add(new NameAndSerializer(name,
							getCustomSerializer(type, serializerContext)));
				}
			}
		}

		private RegexExtractorInterceptorSerializer getCustomSerializer(
				String clazzName, Context context) {
			try {
				RegexExtractorInterceptorSerializer serializer = (RegexExtractorInterceptorSerializer) Class
						.forName(clazzName).newInstance();
				serializer.configure(context);
				return serializer;
			} catch (Exception e) {
				logger.error("Could not instantiate event serializer.", e);
				Throwables.propagate(e);
			}
			return defaultSerializer;
		}

		@Override
		public Interceptor build() {
			Preconditions.checkArgument(regex != null,
					"Regex pattern was misconfigured");
			Preconditions.checkArgument(serializerList.size() > 0,
					"Must supply a valid group match id list");
			return new RegexExtractorExtInterceptor(regex, serializerList,
					extractorHeader, extractorHeaderKey);
		}
	}

	static class NameAndSerializer {
		private final String headerName;
		private final RegexExtractorInterceptorSerializer serializer;

		public NameAndSerializer(String headerName,
				RegexExtractorInterceptorSerializer serializer) {
			this.headerName = headerName;
			this.serializer = serializer;
		}
	}
}

简单说明一下改动的内容:

增加了两个配置参数:

extractorHeader   是否抽取的是header部分,默认为false,即和原始的拦截器功能一致,抽取的是event body的内容

extractorHeaderKey 抽取的header的指定的key的内容,当extractorHeader为true时,必须指定该参数。

按照第八讲的方法,我们将该类打成jar包,作为flume的插件放到了/var/lib/flume-ng/plugins.d/RegexExtractorExtInterceptor/lib目录下,重新启动flume,将该拦截器加载到classpath中。

最终的flume.conf如下:

tier1.sources=source1
tier1.channels=channel1
tier1.sinks=sink1
tier1.sources.source1.type=spooldir
tier1.sources.source1.spoolDir=/opt/logs
tier1.sources.source1.fileHeader=true
tier1.sources.source1.basenameHeader=true
tier1.sources.source1.interceptors=i1
tier1.sources.source1.interceptors.i1.type=com.besttone.flume.RegexExtractorExtInterceptor$Builder
tier1.sources.source1.interceptors.i1.regex=(.*)\\.(.*)\\.(.*)
tier1.sources.source1.interceptors.i1.extractorHeader=true
tier1.sources.source1.interceptors.i1.extractorHeaderKey=basename
tier1.sources.source1.interceptors.i1.serializers=s1 s2 s3
tier1.sources.source1.interceptors.i1.serializers.s1.name=one
tier1.sources.source1.interceptors.i1.serializers.s2.name=two
tier1.sources.source1.interceptors.i1.serializers.s3.name=three
tier1.sources.source1.channels=channel1
tier1.sinks.sink1.type=hdfs
tier1.sinks.sink1.channel=channel1
tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}
tier1.sinks.sink1.hdfs.round=true
tier1.sinks.sink1.hdfs.roundValue=10
tier1.sinks.sink1.hdfs.roundUnit=minute
tier1.sinks.sink1.hdfs.fileType=DataStream
tier1.sinks.sink1.hdfs.writeFormat=Text
tier1.sinks.sink1.hdfs.rollInterval=0
tier1.sinks.sink1.hdfs.rollSize=10240
tier1.sinks.sink1.hdfs.rollCount=0
tier1.sinks.sink1.hdfs.idleTimeout=60
tier1.channels.channel1.type=memory
tier1.channels.channel1.capacity=10000
tier1.channels.channel1.transactionCapacity=1000
tier1.channels.channel1.keep-alive=30

我把source type改回了内置的spooldir,而不是上一讲自定义的source,然后添加了一个拦截器i1,type是自定义的拦截器:com.besttone.flume.RegexExtractorExtInterceptor$Builder,正则表达式按“.”分隔抽取三部分,分别放到header中的key:one,two,three当中去,即a.log.2014-07-31,通过拦截器后,在header当中就会增加三个key: one=a,two=log,three=2014-07-31。这时候我们在tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}。

就实现了和前面第八讲一模一样的需求。


也可以看到,自定义拦截器的改动成本非常小,比自定义source小多了,我们这就增加了一个类,就实现了该功能。

Flume中,自定义拦截器是一个强大的功能,可以用于在数据流经Flume时添加、修改或删除事件头信息。自定义拦截器通常用于日志处理、数据过滤或其他需要根据特定条件处理数据的场景。 以下是如何创建一个自定义拦截器的步骤: ### 1. 创建自定义拦截器类 首先,需要创建一个继承自`org.apache.flume.interceptor.Interceptor`接口的类。这个类需要实现`initialize`、`intercept`和`close`方法。 ```java import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.interceptor.Interceptor; import java.util.List; import java.util.Map; public class CustomInterceptor implements Interceptor { @Override public void initialize() { // 初始化操作 } @Override public Event intercept(Event event) { Map<String, String> headers = event.getHeaders(); String body = new String(event.getBody()); // 根据需要修改头信息或事件内容 headers.put("customHeader", "customValue"); return event; } @Override public List<Event> intercept(List<Event> events) { for (Event event : events) { intercept(event); } return events; } @Override public void close() { // 清理操作 } public static class Builder implements Interceptor.Builder { @Override public Interceptor build() { return new CustomInterceptor(); } @Override public void configure(Context context) { // 配置操作 } } } ``` ### 2. 打包拦截器自定义拦截器类打包成JAR文件,并将其放置在Flume的`lib`目录下。 ### 3. 配置FlumeFlume的配置文件中,添加自定义拦截器的配置。例如: ```properties agent.sources = source1 agent.sinks = sink1 agent.channels = channel1 agent.sources.source1.type = exec agent.sources.source1.command = tail -F /var/log/your_log_file.log agent.sources.source1.interceptors = i1 agent.sources.source1.interceptors.i1.type = com.yourpackage.CustomInterceptor$Builder agent.sources.source1.channels = channel1 agent.sinks.sink1.channel = channel1 agent.sinks.sink1.type = logger agent.channels.channel1.type = memory ``` ### 4. 启动Flume 启动Flume并观察拦截器是否按预期工作。 ```sh flume-ng agent --conf /path/to/conf --conf-file /path/to/flume.conf --name agent -Dflume.root.logger=INFO,console ``` 通过以上步骤,你可以创建一个自定义拦截器,并在Flume中使用它来处理数据流。
评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值