这里以按自定义头部的配置为例(根据某些业务不同写入不同的主目录)
配置:
source:
interceptors = i1
interceptors.i1.type = regex_extractor
interceptors.i1.regex = /apps/logs/(.*?)/
interceptors.i1.serializers = s1
interceptors.i1.serializers.s1.name = logtypename
sink:
hdfs.path = hdfs://xxxxxx/%{logtypename}/%Y%m%d/%H
hdfs.round = true
hdfs.roundValue = 30
hdfs.roundUnit = minute
hdfs.filePrefix = xxxxx1-
在source中定义了regex_extractor 类型的interceptor,使用org.apache.flume.interceptor.RegexExtractorInterceptor类构建interceptor对象,这个interceptor可以根据一个正则表达式提取字符串,并使用serializers把字符串作为header的值,这header可以在sink中获取对应的值做进一步的操作.
比如写hdfs的sink HDFSEventSink的process方法中
// reconstruct the path name by substituting place holders
String realPath = BucketPath.escapeString(filePath, event.getHeaders(),
timeZone, needRounding, roundUnit , roundValue , useLocalTime );
String realName = BucketPath.escapeString(fileName, event.getHeaders(),
timeZone, needRounding, roundUnit , roundValue , useLocalTime );
几个参数项:
useLocalTime 是hdfs.useLocalTimeStamp的设置,默认是false
filePath为hdfs.path的设置,不能为空
fileName为hdfs.filePrefix的设置,默认为FlumeData
rounding(取近似值)的设置相关:
needRounding = context.getBoolean( "hdfs.round", false );
//hdfs.round的设置,默认为false
if(needRounding) {
String unit = context.getString( "hdfs.roundUnit", "second" );
//hdfs.roundUnit,默认为second
if (unit.equalsIgnoreCase( "hour")) {
this.roundUnit = Calendar.HOUR_OF_DAY;
} else if (unit.equalsIgnoreCase("minute" )) {
this.roundUnit = Calendar.MINUTE;
} else if (unit.equalsIgnoreCase("second" )){
this.roundUnit = Calendar.SECOND;
} else {
LOG.warn("Rounding unit is not valid, please set one of" +
"minute, hour, or second. Rounding will be disabled" );
needRounding = false ;
}
this.roundValue = context.getInteger("hdfs.roundValue" , 1);
//hdfs.roundValue值的设置,默认为1
if(roundUnit == Calendar. SECOND || roundUnit == Calendar.MINUTE){
//下面个为检测roundValue的值是否设置合理
Preconditions.checkArgument(roundValue > 0 && roundValue <= 60,
"Round value" +
"must be > 0 and <= 60");
} else if (roundUnit == Calendar.HOUR_OF_DAY){
Preconditions.checkArgument(roundValue > 0 && roundValue <= 24,
"Round value" +
"must be > 0 and <= 24");
}
}
hdfs的具体路径主要由org.apache.flume.formatter.output.BucketPath类的escapeString方法实现
BucketPath类方法分析:
1.escapeString用于替换%{yyy}的设置和%x的设置,需要设置为%x或者%{yyy}的形式,yyy可以是单词字符,和.或者-其调用replaceShorthand
final public static String TAG_REGEX = "\\%(\\w|\\%)|\\%\\{([\\w\\.-]+)\\}";
//正则表达式
final public static Pattern tagPattern = Pattern.compile(TAG_REGEX);
....
public static String escapeString(String in, Map<String, String> headers,
TimeZone timeZone, boolean needRounding, int unit, int roundDown,
boolean useLocalTimeStamp) {
long ts = clock.currentTimeMillis(); //获取当前的时间戳
Matcher matcher = tagPattern.matcher(in);
//对输入的字符串进行matcher操作,返回Matcher对象,比如这里in可以是hdfs.path的设置
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
//用于查看字符串中是否有子字符串可以匹配正则表达式,有的话进入循环中
String replacement = "";
if (matcher.group(2) != null) { //匹配%{...}的设置
replacement = headers.get(matcher.group(2)); //获取对应的header值
if (replacement == null) {
replacement = "";
}
} else { //匹配%x的设置
Preconditions.checkState(matcher.group(1) != null
&& matcher.group(1).length() == 1,
"Expected to match single character tag in string " + in);
char c = matcher.group(1).charAt(0);
replacement = replaceShorthand(c, headers, timeZone,
needRounding, unit, roundDown, useLocalTimeStamp, ts);
//对字符调用replaceShorthand方法
}
replacement = replacement.replaceAll("\\\\", "\\\\\\\\");
replacement = replacement.replaceAll("\\$", "\\\\\\$");
matcher.appendReplacement(sb, replacement);
}
matcher.appendTail(sb);
return sb.toString(); //返回字符串
}
2.replaceShorthand方法用于根据timestamp header的值和round的设置以及路径设置返回对应的日期字符串(比如%Y生成yyyyy(20150310)形式的日期),会调用roundDown方法
protected static String replaceShorthand(char c, Map<String, String> headers,
TimeZone timeZone, boolean needRounding, int unit, int roundDown,
boolean useLocalTimestamp, long ts) {
String timestampHeader = null;
try {
if(!useLocalTimestamp) { //hdfs.useLocalTimeStamp设置为false时(默认)
timestampHeader = headers.get("timestamp"); //获取timestamp的值
Preconditions.checkNotNull(timestampHeader, "Expected timestamp in " +
"the Flume event headers, but it was null");
//检测timestamp header的值是否为空
ts = Long.valueOf(timestampHeader);
} else {
timestampHeader = String.valueOf(ts);
}
}
...
if(needRounding){ //如果hdfs.round设置为true(默认为false)
ts = roundDown(roundDown, unit, ts); //调用roundDown向下取整,生成新的ts
}
// It's a date
String formatString = "";
switch (c) { //对字符串进行匹配,生成日期格式,比如%Y%m%d 最后生成的日期格式为yyyyMMdd
case '%':
return "%";
case 'a':
formatString = "EEE";
break;
.....
case 'z':
formatString = "ZZZ";
break;
default:
// LOG.warn("Unrecognized escape in event format string: %" + c);
return "";
}
SimpleDateFormat format = new SimpleDateFormat(formatString);
//根据格式生成SimpleDateFormat对象
if (timeZone != null) {
format.setTimeZone(timeZone);
}
Date date = new Date(ts); //由ts生成Date对象
return format.format(date); //根据Date对象生成时间字符串
}
3.roundDown用于向下取整
private static long roundDown(int roundDown, int unit, long ts){
long timestamp = ts;
if(roundDown <= 0){
roundDown = 1;
}
switch (unit) {
case Calendar. SECOND:
timestamp = TimestampRoundDownUtil. roundDownTimeStampSeconds(
ts, roundDown); //如果hdfs.roundUnit是second调用TimestampRoundDownUtil.roundDownTimeStampSeconds方法
break;
....
default:
timestamp = ts;
break;
}
return timestamp;
}
转载于:https://blog.51cto.com/caiguangguang/1619539