好的,下面是您所需的内容。
官方文档连接
- Flink 官方文档:https://ci.apache.org/projects/flink/flink-docs-release-1.11/
- Flink Elasticsearch Connector 文档:https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/elasticsearch.html
- Elasticsearch Script Upsert API 文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts
问题描述
在 sink 端存在着老数据更新新数据的现象。例如,A 数据有两个版本: A1 和 A2,1 和 2 可以理解为时间戳,在落 elasticsearch 的时候,我们不能保证 A1 和 A2 的先后顺序,如果,写入 es 的顺序是 A1 A2 ,则 elasticsearch 中的数据是最新的,如果顺序是 A2 A1 ,则 elasticsearch 中的最终保存的数据不是最新的。那如何解决这个问题呢?就需要使用 elasticsearch 的 upsert + script 功能了。
演示代码
以下是使用 Flink 1.11.1 版本实现根据数据中的时间戳更新 Elasticsearch 数据的代码示例。假设我们有一个简单的数据流,包含用户名称、事件时间戳和一些其他信息。我们要将这些数据写入 Elasticsearch 中,并使用时间戳更新已存在的文档。
首先,我们需要引入 Flink Elasticsearch Connector 和 Elasticsearch 的 Java 客户端库。可以使用以下 Maven 依赖项:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch7_2.11</artifactId>
<version>1.14.0</version>
</dependency>
然后,我们需要创建一个 Elasticsearch 连接器,并将其添加到 Flink 程序中:
import org.apache.flink.api.common.functions.RuntimeContext;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkFunction;
import org.apache.flink.streaming.connectors.elasticsearch.RequestIndexer;
import org.apache.flink.streaming.connectors.elasticsearch7.ElasticsearchSink;
import org.apache.http.HttpHost;
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.script.Script;
import org.elasticsearch.script.ScriptType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.*;
/**
* @className: ESSinkTest
* @Description:
* @Author: wangyifei
* @Date: 2023/7/1 12:00
*/
public class ESSinkTest {
private static Logger logger = LoggerFactory.getLogger(ESSinkTest.class);
private static String script = "String pattern = \"yyyy-MM-dd HH:mm:ss\" ;\n" +
" DateTimeFormatter formatter = DateTimeFormatter.ofPattern(pattern);\n" +
" LocalDateTime parse1 = LocalDateTime.parse(\"__date__ 00:00:00\" , formatter);\n" +
" long l = parse1.toInstant(ZoneOffset.ofHours(8)).toEpochMilli();\n" +
" LocalDateTime parse2 = LocalDateTime.parse(ctx._source.date + \" 00:00:00\" , formatter);\n" +
" long ll = parse2.toInstant(ZoneOffset.ofHours(8)).toEpochMilli();\n" +
" \n" +
" if(ll < l){\n" +
" ctx._source.name=\"__name__\";\n" +
" ctx._source.desc=\"__desc__\";\n" +
" ctx._source.date=\"__date__\";\n" +
" ctx._source.price=__price__;\n" +
" }\n";
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStreamSource<String> source = env.socketTextStream("127.0.0.1", 999);
List<HttpHost> https = new ArrayList<>();
https.add(new HttpHost("127.0.0.1", 9200, "http"));
ElasticsearchSink.Builder<String> builder
= new ElasticsearchSink.Builder<String>(
https
, new ElasticsearchSinkFunction<String>() {
@Override
public void process(String cnt, RuntimeContext runtimeContext, RequestIndexer requestIndexer) {
String[] split = cnt.split(",");
String id = split[0];
String name = split[1];
String desc = split[2];
String date = split[3];
String price = split[4];
UpdateRequest request = new UpdateRequest();
String localScript = null;
localScript = script.replace("__name__" , name);
localScript = localScript.replace("__desc__" , desc);
localScript = localScript.replace("__date__" , date);
localScript = localScript.replace("__price__" , price);
request.index("product")
.scriptedUpsert(true)
.id(id)
.script(new Script(ScriptType.INLINE, "painless", localScript, Collections.emptyMap()));
Map<String,Object> bean = new HashMap<>();
bean.put("name",name);
bean.put("desc",desc);
bean.put("date",date);
bean.put("price",price);
System.out.println(bean);
request.upsert(bean);
requestIndexer.add(request);
}
}
);
builder.setBulkFlushMaxActions(1);
ElasticsearchSink<String> http = builder.build();
source.addSink(http);
env.execute();
}
}
在上面的代码中,我们从一个简单的 Socket 数据源中读取数据,然后,我们使用 ElasticsearchSink 来将数据写入 Elasticsearch。在代码中使用 script + upsert 的方式实现了 sink 端的幂等性,而且是老的版本不同更新新的版本。