flume篇1:flume把json数据写入kudu(flume-kudu-sink)
对应非json数据同样适用,可以把非json数据通过拦截器拼接成一个json send出去,这样也是ok的
废话不多说,直接上干货
一、 自定义拦截器:
1 拦截器要求:新建一个新的工程,单独打包,保证每个flume的的拦截器都是单独的一个工程打的包,这样保证每次对拦截器修改的时候不影响其他flume业务,当然你也可以把string或者csv格式的数据split成数组,然后再拼成一个json send出去,这样也是ok的
(个人习惯,喜欢用拦截器,其实可以可下面自定义JsonKuduOperationsProducer两个是冗余的,可以去掉)
拦截器pom如下(pom包多了,可以自己去掉):
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<scala.version>2.10.4</scala.version>
<flume.version>1.8.0</flume.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>${flume.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>commons-net</groupId>
<artifactId>commons-net</artifactId>
<version>3.3</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.testng</groupId>
<artifactId>testng</artifactId>
<version>6.1.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-store-sdk</artifactId>
<version>1.5.3</version>
</dependency>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-core</artifactId>
<version>1.5.3</version>
</dependency>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-common</artifactId>
<version>1.5.3</version>
</dependency>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-format</artifactId>
<version>1.5.3</version>
</dependency>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-hadoop</artifactId>
<version>1.5.3</version>
</dependency>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbondata-processing</artifactId>
<version>1.5.3</version>
</dependency>
<dependency>
<groupId>org.apache.carbondata</groupId>
<artifactId>carbonata</artifactId>
<version>1.5.3</version>
<scope>system</scope>
<systemPath>${project.basedir}/lib/apache-carbondata-1.5.3-bin-spark2.3.2-hadoop2.6.0-cdh5.16.1.jar</systemPath>
</dependency>
<dependency>
<groupId>org.apache.mina</groupId>
<artifactId>mina-core</artifactId>
<version>2.0.9</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>1.9.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.sshd</groupId>
<artifactId>sshd-core</artifactId>
<version>0.14.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.jcraft</groupId>
<artifactId>jsch</artifactId>
<version>0.1.54</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.12</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>16.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.46</version>
<scope>compile</scope>
</dependency>
</dependencies>
2 拦截器代码如下:
(以下拦截器主要目的是:把一个嵌套2层的body Json中的各个字段取出来,并对字段进行重命名)
package com.iflytek.extracting.flume.interceptor;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.text.SimpleDateFormat;
import java.util.List;
import java.util.Map;
public class XyJsonInterceptorTC implements Interceptor {
private static final Logger logger = LoggerFactory.getLogger(XyJsonInterceptorTC .class);
private SimpleDateFormat dataFormat;
@Override
public void initialize() {
dataFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
}
@Override
public Event intercept(Event event) {
String body = new String(event.getBody());
try {
JSONObject jsonObject = JSON.parseObject(body); //把string转成json
JSONObject bodyObject1 = jsonObject.getJSONObject("body"); //获取第一层body
JSONObject bodyObject2 = bodyObject1.getJSONObject("body");//获取第二层body
JSONObject resObject = new JSONObject();//new 一个新的json,用于存放前面json body中取出来的各个字段
getPut(resObject, bodyObject2, "id", "newId");//取出json 中的id,放到新的json中,并重命名为newid
getPut(resObject, bodyObject2, "name", "newName");
getPut(resObject, bodyObject2, "age", "newAge");
getPutDate(resObject, bodyObject2, "time", "newTime", dataFormat);//把Long类型时间戳转成date类型
logger.info("拦截器最hou输出结果为resObject:" + resObject.toString());
event.setBody(resObject.toString().getBytes());
return event;
} catch (Exception e) {
logger.info("ERROR格式数据" + body.toString());//对于不符合约定的数据用日志抛出来,有需要的时候可以从日志中回滚,这里加了try catch十分必要,不然flume处理的错误数据多了,容易挂
return null;
}
}
@Override
public List<Event> intercept(List<Event> events) {
List<Event> resultList = Lists.newArrayList();
for (Event event : events) {
Event result = intercept(event);
if (result != null) {
resultList.add(result);
}
}
return resultList;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder {
@Override
public Interceptor build() {
return new XyJsonInterceptorTC ();
}
@Override
public void configure(Context context) {
}
}
public static void getPut(JSONObject resObject, JSONObject jsonObject, String oldName, String newName) {
Object value = jsonObject.get(oldName);
if (value !=null){
resObject.put(newName, value.toString());
}
}
public static void getPutDate(JSONObject resObject,JSONObject jsonObject,String oldName,String newName,SimpleDateFormat dataFormat) {
Object value = jsonObject.get(oldName);
if (value !=null){
Long valuelong= Long.parseLong(value.toString());
resObject.put(newName, dataFormat.format(valuelong).toString());
}
}
public static void getPutLong(JSONObject resObject,JSONObject jsonObject,String oldName,String newName) {
Object value = jsonObject.get(oldName);
if (value !=null){
Long valuelong= Long.parseLong(value.toString());
resObject.put(newName, valuelong);
}
}
}
3 打包上传,到flume的lib位置,cdh位置如下:/opt/cloudera/parcels/CDH/lib/flume-ng/lib/
二、上传kudu sink包相关jar包
/opt/cloudera/parcels/CDH/lib/flume-ng/lib/ 目录
kudu-client-1.7.0-cdh5.16.1.jar
kudu-flume-sink-1.7.0-cdh5.16.1.jar
如果你的是cdh版本,并且flume已经和kudu关联了,那么这2个包都是自动生成的,就在kudu的lib目录下,不需要重新编译,直接分发到flume 的lib下就好了
三、自定义kudu_json解析器,这是一个通用包,其实打过一次包上传后每个kudu sink都可以使用
1 json解析器要求如下:需要单独新建一个工程,然后单独打包上传即可
2 pom 如下:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<scala.version>2.10.4</scala.version>
<flume.version>1.8.0</flume.version>
<kudu.version>1.7.1</kudu.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>${kudu.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-core</artifactId>
<version>${flume.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-configuration</artifactId>
<version>${flume.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.12</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-flume-sink</artifactId>
<version>1.7.0</version>
</dependency>
</dependencies>
3 代码如下:不用考虑其原理,这个一个通用包,可以直接打包上传就完事了
package com.iflytek;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.google.common.base.Preconditions;
import com.google.common.collect.Lists;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.annotations.InterfaceAudience;
import org.apache.flume.annotations.InterfaceStability;
import org.apache.kudu.ColumnSchema;
import org.apache.kudu.Schema;
import org.apache.kudu.Type;
import org.apache.kudu.client.KuduTable;
import org.apache.kudu.client.Operation;
import org.apache.kudu.client.PartialRow;
import org.apache.kudu.flume.sink.KuduOperationsProducer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
@InterfaceAudience.Public
@InterfaceStability.Evolving
public class XyJsonKuduOperationsProducer implements KuduOperationsProducer {
private static final Logger logger = LoggerFactory.getLogger(XyJsonKuduOperationsProducer.class);
private static final String INSERT = "insert";
private static final String UPSERT = "upsert";
private static final List<String> validOperations = Lists.newArrayList(UPSERT, INSERT);
public static final String ENCODING_PROP = "encoding";
public static final String DEFAULT_ENCODING = "utf-8";
public static final String OPERATION_PROP = "operation";
public static final String DEFAULT_OPERATION = UPSERT;
public static final String SKIP_MISSING_COLUMN_PROP = "skipMissingColumn";
public static final boolean DEFAULT_SKIP_MISSING_COLUMN = false;
public static final String SKIP_BAD_COLUMN_VALUE_PROP = "skipBadColumnValue";
public static final boolean DEFAULT_SKIP_BAD_COLUMN_VALUE = false;
public static final String WARN_UNMATCHED_ROWS_PROP = "skipUnmatchedRows";
public static final boolean DEFAULT_WARN_UNMATCHED_ROWS = true;
private KuduTable table;
private Charset charset;
private String operation;
private boolean skipMissingColumn;
private boolean skipBadColumnValue;
private boolean warnUnmatchedRows;
public XyJsonKuduOperationsProducer() {
}
@Override
public void configure(Context context) {
String charsetName = context.getString(ENCODING_PROP, DEFAULT_ENCODING);
try {
charset = Charset.forName(charsetName);
} catch (IllegalArgumentException e) {
throw new FlumeException(
String.format("Invalid or unsupported charset %s", charsetName), e);
}
operation = context.getString(OPERATION_PROP, DEFAULT_OPERATION).toLowerCase();
Preconditions.checkArgument(
validOperations.contains(operation),
"Unrecognized operation '%s'",
operation);
skipMissingColumn = context.getBoolean(SKIP_MISSING_COLUMN_PROP,
DEFAULT_SKIP_MISSING_COLUMN);
skipBadColumnValue = context.getBoolean(SKIP_BAD_COLUMN_VALUE_PROP,
DEFAULT_SKIP_BAD_COLUMN_VALUE);
warnUnmatchedRows = context.getBoolean(WARN_UNMATCHED_ROWS_PROP,
DEFAULT_WARN_UNMATCHED_ROWS);
}
@Override
public void initialize(KuduTable table) {
this.table = table;
}
@Override
public List<Operation> getOperations(Event event) throws FlumeException {
String s = new String(event.getBody(), charset);
Map<String, Object> rawMap = jsonStr2Map(s);
logger.info("sink数据为"+rawMap.toString());
Schema schema = table.getSchema();
List<Operation> ops = Lists.newArrayList();
if(s != null && !s.isEmpty()) {
Operation op;
switch (operation) {
case UPSERT:
op = table.newUpsert();
break;
case INSERT:
op = table.newInsert();
break;
default:
throw new FlumeException(
String.format("Unrecognized operation type '%s' in getOperations(): " +
"this should never happen!", operation));
}
PartialRow row = op.getRow();
for (ColumnSchema col : schema.getColumns()) {
// logger.info("Column:" + col.getName() + "----" + rawMap.get(col.getName()));
try {
Object mpaValue = rawMap.get(col.getName());
if (mpaValue!=null){
coerceAndSet(mpaValue.toString(), col.getName(), col.getType(), row);
}
} catch (NumberFormatException e) {
String msg = String.format(
"Raw value '%s' couldn't be parsed to type %s for column '%s'",
s, col.getType(), col.getName());
logOrThrow(skipBadColumnValue, msg, e);
} catch (IllegalArgumentException e) {
String msg = String.format(
"Column '%s' has no matching group in '%s'",
col.getName(), s);
logOrThrow(skipMissingColumn, msg, e);
} catch (Exception e) {
throw new FlumeException("Failed to create Kudu operation", e);
}
}
ops.add(op);
}
return ops;
}
/**
* Coerces the string `rawVal` to the type `type` and sets the resulting
* value for column `colName` in `row`.
*
* @param rawVal the raw string column value
* @param colName the name of the column
* @param type the Kudu type to convert `rawVal` to
* @param row the row to set the value in
* @throws NumberFormatException if `rawVal` cannot be cast as `type`.
*/
private void coerceAndSet(String rawVal, String colName, Type type, PartialRow row)
throws NumberFormatException {
switch (type) {
case INT8:
row.addByte(colName, Byte.parseByte(rawVal));
break;
case INT16:
row.addShort(colName, Short.parseShort(rawVal));
break;
case INT32:
row.addInt(colName, Integer.parseInt(rawVal));
break;
case INT64:
row.addLong(colName, Long.parseLong(rawVal));
break;
case BINARY:
row.addBinary(colName, rawVal.getBytes(charset));
break;
case STRING:
row.addString(colName, rawVal==null?"":rawVal);
break;
case BOOL:
row.addBoolean(colName, Boolean.parseBoolean(rawVal));
break;
case FLOAT:
row.addFloat(colName, Float.parseFloat(rawVal));
break;
case DOUBLE:
row.addDouble(colName, Double.parseDouble(rawVal));
break;
case UNIXTIME_MICROS:
row.addLong(colName, Long.parseLong(rawVal));
break;
default:
logger.warn("got unknown type {} for column '{}'-- ignoring this column", type, colName);
}
}
private void logOrThrow(boolean log, String msg, Exception e)
throws FlumeException {
if (log) {
logger.warn(msg, e);
} else {
throw new FlumeException(msg, e);
}
}
@Override
public void close() {
}
public static Map<String, Object> jsonStr2Map(String jsonStr) {
Map<String, Object> resultMap = new HashMap<>();
JSONObject jsonObject = JSON.parseObject(jsonStr);
resultMap=JSON.parseObject(jsonObject.toString(), Map.class);
return resultMap;
}
}
4 打包上传至 /opt/cloudera/parcels/CDH/lib/flume-ng/lib/ 目录
四、配置flume的conf
1 在 /opt/cloudera/parcels/CDH/lib/flume-ng/conf目录下,
vi kudu.conf
输入以下内容:
ng.sources = kafkaSource
ng.channels = memorychannel
ng.sinks = kudusink
ng.sources.kafkaSource.type= org.apache.flume.source.kafka.KafkaSource
ng.sources.kafkaSource.kafka.bootstrap.servers=cdh01:9092,cdh02:9092,cdh03:9092
ng.sources.kafkaSource.kafka.consumer.group.id=xytest1
ng.sources.kafkaSource.kafka.topics=pd_ry_txjl
ng.sources.kafkaSource.batchSize=1000
ng.sources.kafkaSource.channels= memorychannel
ng.sources.kafkaSource.kafka.consumer.auto.offset.reset=latest
ng.sources.kafkaSource.interceptors= i1
ng.sources.kafkaSource.interceptors.i1.type=com.iflytek.extracting.flume.interceptor.XyJsonInterceptorTC$Builder
#自定义的拦截器类,对于kudu其他表的接入,其实后面kudu 的配置都已经固定了,不需要更改,你只需要重新写一个拦截器,并把下面的kudu表名更改即可
ng.channels.memorychannel.type = memory
ng.channels.memorychannel.keep-alive = 3
ng.channels.memorychannel.byteCapacityBufferPercentage = 20
ng.channels.memorychannel.transactionCapacity = 10000
ng.channels.memorychannel.capacity = 100000
ng.sinks.kudusink.type = org.apache.kudu.flume.sink.KuduSink
ng.sinks.kudusink.masterAddresses = cdh01:7051 # kudu 的ip
ng.sinks.kudusink.tableName = rytx_kudu # kudu 表名,这里没有写impapa表名,直接写kudu表名就行了
ng.sinks.kudusink.operation = insert
ng.sinks.kudusink.batchSize = 50
ng.sinks.kudusink.producer = com.iflytek.XyJsonKuduOperationsProducer #自定义的json解析类
ng.sinks.kudusink.channel = memorychannel
2 启动flume:
bin/flume-ng agent -n ng -c conf -f conf/kudu.conf
cdh版本的flume默认的日志打在 /var/log/flume/flume.log里面
查看数据已经接入kudu,并确定没问题可以使用后台提交:
nohup bin/flume-ng agent -n ng -c conf -f conf/kudu.conf &
任务停止:
jcmd | grep kudu.conf # 找到含有 kudu.conf的任务
然后kill 任务id即可
本文借鉴:
https://kudu.apache.org/2016/08/31/intro-flume-kudu-sink.html
https://cloud.tencent.com/developer/article/1158194