今天在写一个Flume的程序,在经过一系列的配置之后发现抛出下面异常,造成Flume启动宕了
must not generate more than one output value per record field
在网上搜索了一下这个问题,只有网友粘的一些官网上的说明,但是并没有太理解是什么问题。
下面是官网的morphlineInterceptor的说明,大概意思就是morphlineIntercepotr目前,有一个限制,拦截器的形容词不能为每个输入事件生成多个输出记录。
This interceptor filters the events through a morphline configuration file that defines a chain of transformation commands that pipe records from one command to another. For example the morphline can ignore certain events or alter or insert certain event headers via regular expression based pattern matching, or it can auto-detect and set a MIME type via Apache Tika on events that are intercepted. For example, this kind of packet sniffing can be used for content based dynamic routing in a Flume topology. MorphlineInterceptor can also help to implement dynamic routing to multiple Apache Solr collections (e.g. for multi-tenancy).
Currently, there is a restriction in that the morphline of an interceptor must not generate more than one output record for each input event. This interceptor is not intended for heavy duty ETL processing - if you need this consider moving ETL processing from the Flume Source to a Flume Sink, e.g. to a MorphlineSolrSink.
先将Interceptor配置附于下面
morphlines : [
{
id : 2map
importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
commands : [
{
readLine {
charset : UTF-8
}
}
{
findReplace {
field : message
pattern : "\""
isRegex : true
replacement : ""
replaceFirst : false
}
}
{
split {
inputField : message
outputFields : [RB040002,RB060002,'','',RB060003]
separator : "@@separator@@"
isRegex : false
addEmptyStrings : true
trim : true
}
}
{
setValues {
dataset : "RWA_BASIC_Z002_1111"
htable : "bcpdata5_struct"
RZ002442 : "@{RB060003}"
namespace : "EXT003"
message : []
_attachment_body : []
}
}
]
}
]
因为不太明白官网的说明,就把源码下了下来跟了一下,发现morphline的splitcommand是把读到的Event分隔并且将outputFields封装到一个Map中,通过观察源码MorphlineInterceptor.java 175行toEvent()方法发现split去除的值为null:{null,null},看到这里发现是这里出的问题下面有个判断
if (entry.getValue().size() > 1) {
throw new FlumeException(getClass().getName()
+ " must not generate more than one output value per record field");
}
原来是这里抛出的异常,那么就看看到底是为什么会造成这样的取值,
好吧去看看morphline的用法
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html
发现split方法如下解释
Property Name | Default | Description |
---|---|---|
outputFields | null | The names of the fields to add output values to, i.e. a list of strings. Example: [firstName, lastName, “”, age]. An empty string in a list indicates omit this column in the output. One of outputField or outputFields must be present, but not both. |
原来是因为如果想跳过某些字符串用的是双引号“” 而我用的是单引号’’
那么将上面的代码修改了一下
outputFields : [RB040002,RB060002,“”,“”,RB060003]
ok问题顺利解决,官网的意思是一个event输入不可以有多个相同的输出项。