kuduwriter-写入kudu提速。

最新推荐文章于 2024-05-16 09:34:11 发布

cclovezbf

最新推荐文章于 2024-05-16 09:34:11 发布

阅读量1.1k

点赞数 2

分类专栏： kudu 文章标签： kudu 写入

本文链接：https://blog.csdn.net/cclovezbf/article/details/126117628

版权

kudu 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

没法所有东西的基本功能完成后，优化就会变成主要任务。

最近搞了kudu->kudu的导数任务。自己写了kudureader，用的datax自带的垃圾writer，没想到速度实在是太慢了。2000+record/s

没法，简单的来看下日志

WaitWriterTime 7873s WaitReaderTime 56s 那么我这的reader没啥问题，那就是write的问题了。

直接看KuduWriterTask代码。忘了这个代码是datax的还是同事写的了。

很明显三个参数。当然肯定还有其他的参数控制。

先说这三个中的两个。

session.setFlushMode(SessionConfiguration.FlushMode.MANUAL_FLUSH); session.setMutationBufferSpace((int) mutationBufferSpace);

一个是flushMode

网上随便一查就是这些，那么这些从那里来的？

Apache Kudu - Using Apache Kudu with Apache Impala

最后一个说了推荐使用C++和java Api去insert

还是回归这两个参数怎么来的？官方文档暂时没查到，还是看github源码找到了

Kudu三种写入模式Flush Mode

三个参数借鉴了上面的文章

AUTO_FLUSH_SYNC

session.apply的时候就会把数据刷到server，但是有个问题如果第1个apply堵塞了或者报错了，第2个apply也会堵住。其实可以看名字看出来 sync就是同步的一般效率会慢点

还有这个你就不用session.flush了写了也没用

特点：及时性好，吞吐量差，注意这个modo没有提到buffer证明根本不需要

AUTO_FLUSH_BACKGROUND

看名字和上面的都有AUTO其实和上面的有相似有不同，相似是也是你apply的时候会立刻返回给你一个response，但是不会立即flush到server，和上面的区别是

OperationResponse response = session.apply(insert);

AUTO_FLUSH_SYNC每次apply都可以获取response的结果是正常还是erros

AUTO_FLUSH_BACKGROUND不行，如果报错了你只能RowErrorsAndOverflowStatus pendingErrors = session.getPendingErrors();这样来获取错误信息。

见官方示例

MANUAL_FLUSH

搞了一个缓冲区，也就是第二个参数buffer，先把数据都放到内存里，然后再发送到server里，这里注意放到缓冲区的数据的大小不能超过BufferSpace

也就是说我们设置FlushMode. MANUAL_FLUSH后，

执行session.apply()时会把数据放到缓冲区 (放多少就是第三个参数batchsize)

然后我们必须使用session.flush,数据才会被发送到server。

——————————————————————————————————————————

上述三种情况都是官方的文字说明，那么怎么深层次的理解呢？看代码

具体看 AsyncKuduSession代码。

if (this.flushMode == FlushMode.AUTO_FLUSH_SYNC) {
    return this.doAutoFlushSync(operation);
}

AUTO_FLUSH_SYNC简介明了直接发送，然后获取返回值

int activeBufferSize = this.activeBuffer.getOperations().size();--获取现在还有多少个没被发送的operation
case MANUAL_FLUSH:
    if (activeBufferSize >= this.mutationBufferMaxOps) {
        statusServiceUnavailable = Status.IllegalState("MANUAL_FLUSH is enabled but the buffer is too big");
        throw new NonRecoverableException(statusServiceUnavailable);--报错了
    }

    this.activeBuffer.getOperations().add(new AsyncKuduSession.BufferedOperation(tablet, operation));--把operation放到list里
    break;

判断activeBufferSize和我们设置的mutationBufferMaxOps大小

activeBufferSize 这个是啥？就是apply一个oper后，就会把这个oper给放到一个list里去，最后这个list的大小，

case AUTO_FLUSH_BACKGROUND:
    if (activeBufferSize >= this.mutationBufferMaxOps) { --如果operations>设置的buffer
        fullBuffer = this.retireActiveBufferUnlocked();--这个就是把满了buffer转移到另外一个list里面去，再把这个buffer清空继续添加oper
        activeBufferSize = 0; --上面的buffer清空了 这个需要被搞的opera也=0了
        if (!this.inactiveBufferAvailable()) {
            statusServiceUnavailable = Status.ServiceUnavailable("All buffers are currently flushing");
            throw new PleaseThrottleException(statusServiceUnavailable, (KuduException)null, operation, notification);
        }

        this.refreshActiveBufferUnlocked();
    }

    this.activeBuffer.getOperations().add(new AsyncKuduSession.BufferedOperation(tablet, operation));
    if (activeBufferSize == 0) {
        AsyncKuduClient.newTimeout(this.client.getTimer(), this.activeBuffer.getFlusherTask(), (long)this.flushIntervalMillis);
    }

    if (activeBufferSize + 1 >= this.mutationBufferMaxOps && this.inactiveBufferAvailable()) {
        fullBuffer = this.retireActiveBufferUnlocked();
    }

 finally {
    this.doFlush(fullBuffer);
}

--(long)this.flushIntervalMillis 这个默认是1000ms顾名思义这个模式是定时任务flush，参考kafka的flush 也是batch或者liger吧(有点忘了)
 --最后有个flush,注意这个fullbuffer只有background才有  懒得说了。借用这位大哥的图

kudu/SessionConfiguration.java at master · apache/kudu · GitHub

package org.apache.kudu.client;

import org.apache.yetus.audience.InterfaceAudience;
import org.apache.yetus.audience.InterfaceStability;

/**
 * Interface that defines the methods used to configure a session. It also exposes ways to
 * query its state.
    这里就是session的所有set和get的config配置
 */
@InterfaceAudience.Public
@InterfaceStability.Evolving
public interface SessionConfiguration {

  @InterfaceAudience.Public
  @InterfaceStability.Evolving
  enum FlushMode {
    /**
     * Each {@link KuduSession#apply KuduSession.apply()} call will return only after being
     * flushed to the server automatically. No batching will occur.
     *
     * <p>In this mode, the {@link KuduSession#flush} call never has any effect, since each
     * {@link KuduSession#apply KuduSession.apply()} has already flushed the buffer before
     * returning.
     *
     * <p><strong>This is the default flush mode.</strong>
     */
    AUTO_FLUSH_SYNC,

    /**
     * {@link KuduSession#apply KuduSession.apply()} calls will return immediately, but the writes
     * will be sent in the background, potentially batched together with other writes from
     * the same session. If there is not sufficient buffer space, then
     * {@link KuduSession#apply KuduSession.apply()} may block for buffer space to be available.
     *
     * <p>Because writes are applied in the background, any errors will be stored
     * in a session-local buffer. Call {@link #countPendingErrors() countPendingErrors()} or
     * {@link #getPendingErrors() getPendingErrors()} to retrieve them.
     *
     * <p><strong>Note:</strong> The {@code AUTO_FLUSH_BACKGROUND} mode may result in
     * out-of-order writes to Kudu. This is because in this mode multiple write
     * operations may be sent to the server in parallel.
     * See <a href="https://issues.apache.org/jira/browse/KUDU-1767">KUDU-1767</a> for more
     * information.
     *
     * <p>The {@link KuduSession#flush()} call can be used to block until the buffer is empty.
     */
    AUTO_FLUSH_BACKGROUND,

    /**
     * {@link KuduSession#apply KuduSession.apply()} calls will return immediately, but the writes
     * will not be sent until the user calls {@link KuduSession#flush()}. If the buffer runs past
     * the configured space limit, then {@link KuduSession#apply KuduSession.apply()} will return
     * an error.
     */
    MANUAL_FLUSH
  }

  /**
   * Get the current flush mode.
   * @return flush mode, {@link FlushMode#AUTO_FLUSH_SYNC AUTO_FLUSH_SYNC} by default
   */
  FlushMode getFlushMode();

  /**
   * Set the new flush mode for this session.
   * @param flushMode new flush mode, can be the same as the previous one.
   * @throws IllegalArgumentException if the buffer isn't empty.
   */
  void setFlushMode(FlushMode flushMode);

  /**
   * Set the number of operations that can be buffered.
   * @param size number of ops.
   * @throws IllegalArgumentException if the buffer isn't empty.
   */
  void setMutationBufferSpace(int size);

  /**
   * Set the low watermark for this session. The default is set to half the mutation buffer space.
   * For example, a buffer space of 1000 with a low watermark set to 50% (0.5) will start randomly
   * sending PleaseRetryExceptions once there's an outstanding flush and the buffer is over 500.
   * As the buffer gets fuller, it becomes likelier to hit the exception.
   * @param mutationBufferLowWatermarkPercentage a new low watermark as a percentage,
   *                             has to be between 0  and 1 (inclusive). A value of 1 disables
   *                             the low watermark since it's the same as the high one
   * @throws IllegalArgumentException if the buffer isn't empty or if the watermark isn't between
   * 0 and 1
   * @deprecated The low watermark no longer has any effect.
   */
  @Deprecated
  void setMutationBufferLowWatermark(float mutationBufferLowWatermarkPercentage);

  /**
   * Set the flush interval, which will be used for the next scheduling decision.
   * @param interval interval in milliseconds.
   */
  void setFlushInterval(int interval);

  /**
   * Get the current timeout.
   * @return operation timeout in milliseconds, 0 if none was configured.
   */
  long getTimeoutMillis();

  /**
   * Sets the timeout for the next applied operations.
   * The default timeout is 0, which disables the timeout functionality.
   * @param timeout Timeout in milliseconds.
   */
  void setTimeoutMillis(long timeout);

  /**
   * Returns true if this session has already been closed.
   */
  boolean isClosed();

  /**
   * Check if there are operations that haven't been completely applied.
   * @return true if operations are pending, else false.
   */
  boolean hasPendingOperations();

  /**
   * Set the new external consistency mode for this session.
   * @param consistencyMode new external consistency mode, can the same as the previous one.
   * @throws IllegalArgumentException if the buffer isn't empty.
   */
  void setExternalConsistencyMode(ExternalConsistencyMode consistencyMode);

  /**
   * Tells if the session is currently ignoring row errors when the whole list returned by a tablet
   * server is of the AlreadyPresent type.
   * @return true if the session is enforcing this, else false
   */
  boolean isIgnoreAllDuplicateRows();

  /**
   * Configures the option to ignore all the row errors if they are all of the AlreadyPresent type.
   * This can be useful when it is possible for INSERT operations to be retried and fail.
   * The effect of enabling this is that operation responses that match this pattern will be
   * cleared of their row errors, meaning that we consider them successful.
   *
   * TODO(KUDU-1563): Implement server side ignore capabilities to improve performance and
   *  reliability of INSERT ignore operations.
   *
   * <p>Disabled by default.
   * @param ignoreAllDuplicateRows true if this session should enforce this, else false
   */
  void setIgnoreAllDuplicateRows(boolean ignoreAllDuplicateRows);

  /**
   * Tells if the session is currently ignoring row errors when the whole list returned by a tablet
   * server is of the NotFound type.
   * @return true if the session is enforcing this, else false
   */
  boolean isIgnoreAllNotFoundRows();

  /**
   * Configures the option to ignore all the row errors if they are all of the NotFound type.
   * This can be useful when it is possible for DELETE operations to be retried and fail.
   * The effect of enabling this is that operation responses that match this pattern will be
   * cleared of their row errors, meaning that we consider them successful.
   *
   * TODO(KUDU-1563): Implement server side ignore capabilities to improve performance and
   *  reliability of DELETE ignore operations.
   *
   * <p>Disabled by default.
   * @param ignoreAllNotFoundRows true if this session should enforce this, else false
   */
  void setIgnoreAllNotFoundRows(boolean ignoreAllNotFoundRows);

  /**
   * Set the number of errors that can be collected.
   * @param size number of errors.
   */
  void setErrorCollectorSpace(int size);

  /**
   * Return the number of errors which are pending. Errors may accumulate when
   * using {@link FlushMode#AUTO_FLUSH_BACKGROUND AUTO_FLUSH_BACKGROUND} mode.
   * @return a count of errors
   */
  int countPendingErrors();

  /**
   * Return any errors from previous calls. If there were more errors
   * than could be held in the session's error storage, the overflow state is set to true.
   *
   * <p>Clears the pending errors.
   * @return an object that contains the errors and the overflow status
   */
  RowErrorsAndOverflowStatus getPendingErrors();

  /**
   * Return cumulative write operation metrics since the beginning of the session.
   * @return cumulative write operation metrics since the beginning of the session.
   */
  ResourceMetrics getWriteOpMetrics();
}

第一次实验

"batchSize": "10",
"bufferSize": "10000",

1000+records/s 垃圾中的战斗机

第二次实验

"batchSize": "100",
"bufferSize": "10000"

5000+records/s 垃圾

第三次

"batchSize": "1000",
"bufferSize": "10000"

7800+records/s 垃圾

然后后面奇怪的事发生了。

不管我调整batchSize多大 2000 5000 10000 速度都在7000+左右，说实话我不能接受这个速度。

这个时候有人会怀疑是不是kudu的瓶颈？本身kudu的性能不行呢？

测试系下很简单，启动两个datax任务都是oracle->kudu 这个表，(因为kudu的数据重复插入是不会改变的)，如果启动了第二个任务影响了第一个，说明kudu的插入上限就是8000r/s，测试发现两个任务都事7000r/s 那么问题出在哪里了？

首先肯定是kudu的问题，

参考下别人文章提到了一个参数

maintenance_manager_num_threads

The number of threads devoted to background maintenance operations such as flushes and compactions. If the tablet server appears to be falling behind on write operations (inserts, updates, and deletes) but CPU and disk resources are not saturated, increasing this thread count will devote more resources to these background operations.

注意这里有个flush 说明调大这个参数是有用，因为集群不好随便停止，而且感觉对这个kudu的写入需求不打暂时不测了。。

cclovezbf

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
4
评论
kuduwriter-写入kudu提速。

测试系下很简单，启动两个datax任务都是oracle->kudu 这个表，(因为kudu的数据重复插入是不会改变的)，如果启动了第二个任务影响了第一个，说明kudu的插入上限就是8000r/s，测试发现两个任务都事7000r/s 那么问题出在哪里了？这样来获取错误信息。看名字和上面的都有AUTO其实和上面的有相似有不同，相似是也是你apply的时候会立刻返回给你一个response，但是不会立即flush到server，和上面的区别是。--上面的buffer清空了这个需要被搞的opera也=0了。..
复制链接

扫一扫