kudu 插入数据的三种方式

最新推荐文章于 2023-08-14 22:23:09 发布

Seniscz

最新推荐文章于 2023-08-14 22:23:09 发布

阅读量1.1w

点赞数

分类专栏： kudu

本文链接：https://blog.csdn.net/CZ_yjsy_data/article/details/88390696

版权

kudu 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

JAVA API 提供了三种向 kudu 插入数据的刷新策略，分别为：

1、AUTO_FLUSH_SYNC

2、AUTO_FLUSH_BACKGROUND

3、MANUAL_FLUSH

如源码所示：

public interface SessionConfiguration {

  @InterfaceAudience.Public
  @InterfaceStability.Evolving
  enum FlushMode {
    /**
     * Each {@link KuduSession#apply KuduSession.apply()} call will return only after being
     * flushed to the server automatically. No batching will occur.
     *
     * <p>In this mode, the {@link KuduSession#flush} call never has any effect, since each
     * {@link KuduSession#apply KuduSession.apply()} has already flushed the buffer before
     * returning.
     *
     * <p><strong>This is the default flush mode.</strong>
     */
    AUTO_FLUSH_SYNC,

    /**
     * {@link KuduSession#apply KuduSession.apply()} calls will return immediately, but the writes
     * will be sent in the background, potentially batched together with other writes from
     * the same session. If there is not sufficient buffer space, then
     * {@link KuduSession#apply KuduSession.apply()} may block for buffer space to be available.
     *
     * <p>Because writes are applied in the background, any errors will be stored
     * in a session-local buffer. Call {@link #countPendingErrors() countPendingErrors()} or
     * {@link #getPendingErrors() getPendingErrors()} to retrieve them.
     *
     * <p><strong>Note:</strong> The {@code AUTO_FLUSH_BACKGROUND} mode may result in
     * out-of-order writes to Kudu. This is because in this mode multiple write
     * operations may be sent to the server in parallel.
     * See <a href="https://issues.apache.org/jira/browse/KUDU-1767">KUDU-1767</a> for more
     * information.
     *
     * <p>The {@link KuduSession#flush()} call can be used to block until the buffer is empty.
     */
    AUTO_FLUSH_BACKGROUND,

    /**
     * {@link KuduSession#apply KuduSession.apply()} calls will return immediately, but the writes
     * will not be sent until the user calls {@link KuduSession#flush()}. If the buffer runs past
     * the configured space limit, then {@link KuduSession#apply KuduSession.apply()} will return
     * an error.
     */
    MANUAL_FLUSH
  }

简要说下这三种刷新策略的意思：

1、AUTO_FLUSH_SYNC（默认），意思是调用 KuduSession.apply() 方法后，客户端会在当数据刷新到服务器后再返回，这种情况就不能批量插入数据，调用 KuduSession.flush() 方法不会起任何作用，应为此时缓冲区数据已经被刷新到了服务器。

2、AUTO_FLUSH_BACKGROUND，意思是调用 KuduSession.apply() 方法后，客户端会立即返回，但是写入将在后台发送，可能与来自同一会话的其他写入一起进行批处理。如果没有足够的缓冲空间，KuduSession.apply()会阻塞，缓冲空间不可用。因为写入操作是在后台应用进行的的，因此任何错误都将存储在一个会话本地缓冲区中。注意：这个模式可能会导致数据插入是乱序的，这是因为在这种模式下，多个写操作可以并发地发送到服务器。即此处为 kudu 自身的一个 bug,KUDU-1767 已经说明。

3、MANUAL_FLUSH,意思是调用 KuduSession.apply() 方法后，会返回的非常快,但是写操作不会发送，直到用户使用flush()函数，如果缓冲区超过了配置的空间限制，KuduSession.apply()函数会返回一个错误。

以上三种方法实践中已经证明第三种方法效率更高，可是任然存在问题：

当我用 flink 做实时业务处理时，存在数据丢失的问题。

例如：我用 flink 消费 kafka 中的数据实时的插入 kudu 数据库中，我用 MANUAL_FLUSH 方式插入数据，我设置当缓冲区满 10 条数据时调用 session.flush() 开始将数据刷新到磁盘，但是，当客户端向缓冲区写入了9条数据，未满10条，则此时由于断电或者其他事故造成业务停止，则这9条数据并没有刷新到磁盘，当我重启业务时（这里 flink 的程序做了Checkpoint），这9条数据并没有插入到数据库中，而如何处理这种问题，我目前并没有得到解决,希望能够和大家共同探讨。

这里给出当时写的 KuduSink 代码：

object KuduSink extends RichSinkFunction[(String, String, String, String, String, String, String, String)] {
private val logger = LoggerFactory.getLogger(KuduSink.getClass)
var clint: KuduClient = null
var session: KuduSession = null
var table: KuduTable = null
// 定义累加器，设置缓冲大小
var OPERATION_BATCH: IntCounter = null
private lazy val PATH = PATH_Qqwry_dat.stringValue

/**
* 业务逻辑处理
*
* @param value value值为： (preEvent,preIp,preOs,preOsVersion,preLib,preBrowser,preBrowserVersion,preProject)
*/
override def invoke(value: (String, String, String, String, String, String, String, String)): Unit = {

}

def insertDB(insert: Insert): Unit ={
//将数据插入 Kudu
session.apply(insert)
this.OPERATION_BATCH.add(1)
val num = this.OPERATION_BATCH.getLocalValue
//内存数据每满 10条将数据刷入到磁盘
if (num > 9) {
session.flush()
this.OPERATION_BATCH.resetLocal()
// 确保数据插入成功
if (this.OPERATION_BATCH.getLocalValue > 0) {
session.flush()
}
}
}
/**
* 创建 Kudu 连接
*
* @param parameters
*/
override def open(parameters: Configuration): Unit = {
clint = new KuduClient.KuduClientBuilder(KUDU_MASTER.stringValue).build()
session = clint.newSession()
table = clint.openTable(KUDU_TABLE_MK_DataDictionary.stringValue)
val mode = SessionConfiguration.FlushMode.MANUAL_FLUSH
session.setFlushMode(mode)
OPERATION_BATCH = new IntCounter()
//getRuntimeContext().addAccumulator("operationBatch",this.OPERATION_BATCH)
}

/**
* 关闭 Kudu 连接
*/
override def close(): Unit = {
if (session != null) {
session.close()
}
if (clint != null) {
clint.close()
}
}
}

~~具体代码详见个人git: https://github.com/seniscz/stream/blob/master/flink/flinkexample/flinkexample-parent/flink-db/src/main/scala/com/cz/datadictionary/KuduSink.scala~~

Seniscz

关注

0
点赞
踩
13

收藏

觉得还不错? 一键收藏
10
评论
kudu 插入数据的三种方式

JAVA API 提供了三种向 kudu 插入数据的刷新策略，分别为：1、AUTO_FLUSH_SYNC2、AUTO_FLUSH_BACKGROUND3、MANUAL_FLUSH如源码所示：public interface SessionConfiguration { @InterfaceAudience.Public @InterfaceStability.E...
复制链接

扫一扫

专栏目录