SeaTunnel 2.1.2的源码解析(3)seatunnel-connectors-flink-es\kafka
本文已参与「开源摘星计划」,欢迎正在阅读的你加入。活动链接:https://github.com/weopenprojects/WeOpen-Star
SeaTunnel 是一个简单易用的数据集成框架,在企业中,由于开发时间或开发部门不通
用,往往有多个异构的、运行在不同的软硬件平台上的信息系统同时运行。数据集成是把
不同来源、格式、特点性质的数据在逻辑上或物理上有机地集中,从而为企业提供全面的
数据共享。
提示:以下是本篇文章正文内容,下面源码分析可供参考,如有出错请指正!
一、seatunnel-connectors-flink-elasticsearch6/7
1.总览
2.源码分析
ES7-ElasticsearchOutputFormat 部分源码如下:
@Override
public void configure(Configuration configuration) {
List<String> hosts = config.getStringList(HOSTS);
Settings.Builder settings = Settings.builder();
config.entrySet().forEach(entry -> {
String key = entry.getKey();
Object value = entry.getValue().unwrapped();
if (key.startsWith(PREFIX)) {
settings.put(key.substring(PREFIX.length()), value.toString());
}
});
TransportClient transportClient = new PreBuiltTransportClient(settings.build());
//hosts:连接到的 Elasticsearch 集群主机列表
for (String host : hosts) {
try {
transportClient.addTransportAddresses(new TransportAddress(InetAddress.getByName(host.split(":")[0]), Integer.parseInt(host.split(":")[1])));
} catch (Exception e) {
LOGGER.warn("Host '{}' parse failed.", host, e);
}
}
//批量处理
BulkProcessor.Builder bulkProcessorBuilder = BulkProcessor.builder(transportClient, new BulkProcessor.Listener() {
@Override
public void beforeBulk(long executionId, BulkRequest request) {
}
@Override
public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {
}
@Override
public void afterBulk(long executionId, BulkRequest request, Throwable failure) {
}
});
// 设置并发请求数,默认是1,允许执行1个并发请求,积累bulk requests和发送bulk是异步的
// bulkProcessorBuilder.setConcurrentRequests(0);
bulkProcessor = bulkProcessorBuilder.build();
requestIndexer = new RequestIndexer() {
@Override
public void add(DeleteRequest... deleteRequests) {
}
@Override
public void add(IndexRequest... indexRequests) {
for (IndexRequest indexRequest : indexRequests) {
// 用requestIndexer发送http请求
bulkProcessor.add(indexRequest);
}
}
@Override
public void add(UpdateRequest... updateRequests) {
}
};
}
@Override
public void writeRecord(T t) {
//elasticsearchSinkFunction:说明具体处理逻辑、准备数据向 es 发送请求的函数
elasticsearchSinkFunction.process(t, getRuntimeContext(), requestIndexer);
}
ES7-Elasticsearch 部分源码如下:
//流输出
@Override
public void outputStream(FlinkEnvironment env, DataStream<Row> dataStream) {
List<HttpHost> httpHosts = new ArrayList<>();
List<String> hosts = config.getStringList(HOSTS);
for (String host : hosts) {
//httpHosts:连接到的 Elasticsearch 集群主机列表
httpHosts.add(new HttpHost(host.split(":")[0], Integer.parseInt(host.split(":")[1]), "http"));
}
RowTypeInfo rowTypeInfo = (RowTypeInfo) dataStream.getType();
indexName = StringTemplate.substitute(config.getString(INDEX), config.getString(INDEX_TIME_FORMAT));
//Flink 为 ElasticSearch 专门提供了官方的 Sink 连接器ElasticsearchSink
//ElasticsearchSink类的构造方法是私有的,需用它的 Builder内部静态类
ElasticsearchSink.Builder<Row> esSinkBuilder = new ElasticsearchSink.Builder<>(
httpHosts, (ElasticsearchSinkFunction<Row>) (element, ctx, indexer) ->
indexer.add(createIndexRequest(rowTypeInfo.getFieldNames(), element))
);
// 每来1个request,执行一次bulk操作【实时】
esSinkBuilder.setBulkFlushMaxActions(1);
// finally, build and add the sink to the job's pipeline
if (config.hasPath(PARALLELISM)) {
int parallelism = config.getInt(PARALLELISM);
//ElasticsearchSink需用Builder内部静态类的build()方法才能创建出SinkFunction。
dataStream.addSink(esSinkBuilder.build()).setParallelism(parallelism);
} else {
dataStream.addSink(esSinkBuilder.build());
}
}
//批输出
@Override
public void outputBatch(FlinkEnvironment env, DataSet<Row> dataSet) {
RowTypeInfo rowTypeInfo = (RowTypeInfo) dataSet.getType();
indexName = StringTemplate.substitute(config.getString(INDEX), config.getString(INDEX_TIME_FORMAT));
//DataSet 作为批处理 API 实际应用较少,新版本 1.12.0已经完全实现了真正的流批一体,DataSet API 已处于软性弃用(soft deprecated)的状态
DataSink<Row> dataSink = dataSet.output(new ElasticsearchOutputFormat<>(config,
(ElasticsearchSinkFunction<Row>) (element, ctx, indexer) ->
indexer.add(createIndexRequest(rowTypeInfo.getFieldNames(), element))));
if (config.hasPath(PARALLELISM)) {
int parallelism = config.getInt(PARALLELISM);
dataSink.setParallelism(parallelism);
}
}
二、seatunnel-connectors-flink-kafka
1.总览
2.源码分析
KafkaSink部分源码如下:
//获取语义
private FlinkKafkaProducer.Semantic getSemanticEnum(String semantic) {
if ("exactly_once".equals(semantic)) {
//精准一次性语义;
//补充:在具体应用中,实现真正的端到端 exactly-once,需配置:在 FlinkKafkaProducer 的构造函数中传入参数 Semantic.EXACTLY_ONCE
return FlinkKafkaProducer.Semantic.EXACTLY_ONCE;
} else if ("at_least_once".equals(semantic)) {
至少一次性语义
return FlinkKafkaProducer.Semantic.AT_LEAST_ONCE;
} else {
return FlinkKafkaProducer.Semantic.NONE;
}
}