JAVA API 提供了三种向 kudu 插入数据的刷新策略,分别为:
1、AUTO_FLUSH_SYNC
2、AUTO_FLUSH_BACKGROUND
3、MANUAL_FLUSH
如源码所示:
public interface SessionConfiguration { @InterfaceAudience.Public @InterfaceStability.Evolving enum FlushMode { /** * Each {@link KuduSession#apply KuduSession.apply()} call will return only after being * flushed to the server automatically. No batching will occur. * * <p>In this mode, the {@link KuduSession#flush} call never has any effect, since each * {@link KuduSession#apply KuduSession.apply()} has already flushed the buffer before * returning. * * <p><strong>This is the default flush mode.</strong> */ AUTO_FLUSH_SYNC, /** * {@link KuduSession#apply KuduSession.apply()} calls will return immediately, but the writes * will be sent in the background, potentially batched together with other writes from * the same session. If there is not sufficient buffer space, then * {@link KuduSession#apply KuduSession.apply()} may block for buffer space to be available. * * <p>Because writes are applied in the background, any errors will be stored * in a session-local buffer. Call {@link #countPendingErrors() countPendingErrors()} or * {@link #getPendingErrors() getPendingErrors()} to retrieve them. * * <p><strong>Note:</strong> The {@code AUTO_FLUSH_BACKGROUND} mode may result in * out-of-order writes to Kudu. This is because in this mode multiple write * operations may be sent to the server in parallel. * See <a href="https://issues.apache.org/jira/browse/KUDU-1767">KUDU-1767</a> for more * information. * * <p>The {@link KuduSession#flush()} call can be used to block until the buffer is empty. */ AUTO_FLUSH_BACKGROUND, /** * {@link KuduSession#apply KuduSession.apply()} calls will return immediately, but the writes * will not be sent until the user calls {@link KuduSession#flush()}. If the buffer runs past * the configured space limit, then {@link KuduSession#apply KuduSession.apply()} will return * an error. */ MANUAL_FLUSH }
简要说下这三种刷新策略的意思:
1、AUTO_FLUSH_SYNC(默认),意思是调用 KuduSession.apply() 方法后,客户端会在当数据刷新到服务器后再返回,这种情况就不能批量插入数据,调用 KuduSession.flush() 方法不会起任何作用,应为此时缓冲区数据已经被刷新到了服务器。
2、AUTO_FLUSH_BACKGROUND,意思是调用 KuduSession.apply() 方法后,客户端会立即返回,但是写入将在后台发送,可能与来自同一会话的其他写入一起进行批处理。如果没有足够的缓冲空间,KuduSession.apply()会阻塞,缓冲空间不可用。因为写入操作是在后台应用进行的的,因此任何错误都将存储在一个会话本地缓冲区中。注意:这个模式可能会导致数据插入是乱序的,这是因为在这种模式下,多个写操作可以并发地发送到服务器。即此处为 kudu 自身的一个 bug,KUDU-1767 已经说明。
3、MANUAL_FLUSH,意思是调用 KuduSession.apply() 方法后,会返回的非常快,但是写操作不会发送,直到用户使用flush()函数,如果缓冲区超过了配置的空间限制,KuduSession.apply()函数会返回一个错误。
以上三种方法实践中已经证明 第三种方法效率更高,可是任然存在问题:
当我用 flink 做实时业务处理时,存在数据丢失的问题。
例如:我用 flink 消费 kafka 中的数据实时的插入 kudu 数据库中,我用 MANUAL_FLUSH 方式插入数据, 我设置当缓冲区满 10 条数据时 调用 session.flush() 开始将数据刷新到磁盘,但是,当客户端向缓冲区写入了9条数据,未满10条,则此时由于断电或者其他事故造成业务停止,则这9条数据并没有刷新到磁盘,当我重启业务时(这里 flink 的程序做了Checkpoint),这9条数据并没有插入到数据库中,而如何处理这种问题,我目前并没有得到解决,希望能够和大家共同探讨。
这里给出当时写的 KuduSink 代码:
object KuduSink extends RichSinkFunction[(String, String, String, String, String, String, String, String)] {
private val logger = LoggerFactory.getLogger(KuduSink.getClass)
var clint: KuduClient = null
var session: KuduSession = null
var table: KuduTable = null
// 定义累加器 ,设置缓冲大小
var OPERATION_BATCH: IntCounter = null
private lazy val PATH = PATH_Qqwry_dat.stringValue/**
* 业务逻辑处理
*
* @param value value值为: (preEvent,preIp,preOs,preOsVersion,preLib,preBrowser,preBrowserVersion,preProject)
*/
override def invoke(value: (String, String, String, String, String, String, String, String)): Unit = {}
def insertDB(insert: Insert): Unit ={
//将数据插入 Kudu
session.apply(insert)
this.OPERATION_BATCH.add(1)
val num = this.OPERATION_BATCH.getLocalValue
//内存数据每满 10条 将数据刷入到磁盘
if (num > 9) {
session.flush()
this.OPERATION_BATCH.resetLocal()
// 确保数据插入成功
if (this.OPERATION_BATCH.getLocalValue > 0) {
session.flush()
}
}
}
/**
* 创建 Kudu 连接
*
* @param parameters
*/
override def open(parameters: Configuration): Unit = {
clint = new KuduClient.KuduClientBuilder(KUDU_MASTER.stringValue).build()
session = clint.newSession()
table = clint.openTable(KUDU_TABLE_MK_DataDictionary.stringValue)
val mode = SessionConfiguration.FlushMode.MANUAL_FLUSH
session.setFlushMode(mode)
OPERATION_BATCH = new IntCounter()
//getRuntimeContext().addAccumulator("operationBatch",this.OPERATION_BATCH)
}/**
* 关闭 Kudu 连接
*/
override def close(): Unit = {
if (session != null) {
session.close()
}
if (clint != null) {
clint.close()
}
}
}