Kudu java 客户端Row 读写数据结构设计

最新推荐文章于 2022-07-22 11:19:20 发布

彼岸枫雪非

最新推荐文章于 2022-07-22 11:19:20 发布

阅读量931

点赞数

分类专栏： kudu 文章标签： kudu

本文链接：https://blog.csdn.net/u012543819/article/details/105494811

版权

kudu 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

刚开始工作时有个研发大哥告诉，看源码时，首先要了解各部分的数据结构设计，搞清楚数据结构的设计对于理解代码很有帮助，我一直将这句话记在心里。此前因为工作需要，需要对kudu-client 的部分代码进行修改，以实现数据加解密。因此，我仔细研究了一下kudu java client 中对于kudu row的数据结构设计和数据的读写流程，这里我将row的设计思路捋一下，并做一个笔记。

数据写入–PartialRow

Kudu clien中对于一行数据（Row）的数据结构设计如下：
PartialRow 类图
结构整体是比较简单的：
schama ：保存了字段的信息
varLengthData：用于存储变长数据，如string， binary这种类型数据
rowAlloc：用于存储数值类型这种定长数据
两个bitset用于记录字段是否set了值， froze用于控制row是否可修改
上述数据结构初始化工作如下：

/**
   * This is not a stable API, prefer using {@link Schema#newPartialRow()}
   * to create a new partial row.
   * @param schema the schema to use for this row
   */
  public PartialRow(Schema schema) {
    this.schema = schema;
    this.columnsBitSet = new BitSet(this.schema.getColumnCount());
    this.nullsBitSet = schema.hasNullableColumns() ?
        new BitSet(this.schema.getColumnCount()) : null;
    this.rowAlloc = new byte[schema.getRowSize()];
    // Pre-fill the array with nulls. We'll only replace cells that have varlen values.
    this.varLengthData = Arrays.asList(new ByteBuffer[this.schema.getColumnCount()]);
  }

上面对varLengthData 进行了初始化，并且计算了rowAlloc 的size，这里我们重点看下如何计算rowAlloc 数据大小的：

/**
   * Gives the size in bytes for a single row given the specified schema
   * @param columns the row's columns
   * @return row size in bytes
   */
  private  int getRowSize(List<ColumnSchema> columns) {
    int totalSize = 0;
    boolean hasNullables = false;
    Set<Integer> encryptedColIndices = getEncryptedColumnIdsToAlgortithms().keySet();
    int size = columns.size();
    for (int i = 0; i < size; i++) {
      totalSize += columns.get(i).getTypeSize(encryptedColIndices.contains(i));
      hasNullables |= columns.get(i).isNullable();
    }
    if (hasNullables) {
      totalSize += Bytes.getBitSetSize(columns.size());
    }
    return totalSize;
  }

上面的代码对每个字段进行遍历，计算数据长度。我们再看下每个数据类型到底占了多大的长度：

/**
   * Gives the size in bytes for a given DataType, as per the pb specification
   * @param type pb type
   * @return size in bytes
   */
  private static int getTypeSize(DataType type) {
    switch (type) {
      case STRING:
      case BINARY:
        return 8 + 8; // offset then string length
      case BOOL:
      case INT8:
      case IS_DELETED:
        return 1;
      case INT16:
        return Shorts.BYTES;
      case INT32:
      case FLOAT:
        return Ints.BYTES;
      case INT64:
      case DOUBLE:
      case UNIXTIME_MICROS:
        return Longs.BYTES;
      default: throw new IllegalArgumentException("The provided data type doesn't map" +
          " to know any known one.");
    }
  }

这里给string 和binary分配了16个字节用于记录数据的offset和数据的实际长度（8+8），其余的数据根据当前的情况分配固定的长度。

PartialRow 数据写入

这里写入就是调用其addXX方法，我们以分别为定长数据和变长数据举一个set数据的例子。
定长数据以int数据为例：

/**
   * Add an int for the specified column.
   * @param columnIndex the column's index in the schema
   * @param val value to add
   * @throws IllegalArgumentException if the value doesn't match the column's type
   * @throws IllegalStateException if the row was already applied
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public void addInt(int columnIndex, int val) {
    checkNotFrozen();
    checkColumn(schema.getColumnByIndex(columnIndex), Type.INT32);
    Bytes.setInt(rowAlloc, val, getPositionInRowAllocAndSetBitSet(columnIndex));

这里非常简单，就是直接把值set到rowAlloc 字节数组里面，并且将改字段标记为已set值，重要的是如何拿到数据应该存放的位置，即getPositionInRowAllocAndSetBitSet

 /**
   * Sets the column bit set for the column index, and returns the column's offset.
   * @param columnIndex the index of the column to get the position for and mark as set
   * @return the offset in rowAlloc for the column
   */
  private int getPositionInRowAllocAndSetBitSet(int columnIndex) {
    columnsBitSet.set(columnIndex);
    return schema.getColumnOffset(columnIndex);
  }

变长,以string数据为例，非常简单，直接放到addVarLengthData里面就可以了。

 /**
   * Add a String for the specified value, encoded as UTF8.
   * Note that the provided value must not be mutated after this.
   * @param columnIndex the column's index in the schema
   * @param val value to add
   * @throws IllegalArgumentException if the value doesn't match the column's type
   * @throws IllegalStateException if the row was already applied
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public void addStringUtf8(int columnIndex, byte[] val) {
    // TODO: use Utf8.isWellFormed from Guava 16 to verify that.
    // the user isn't putting in any garbage data.
    checkNotFrozen();
    checkColumn(schema.getColumnByIndex(columnIndex), Type.STRING);
    addVarLengthData(columnIndex, val);
  }

数据读取–RowResult

注释中写道：RowResult represents one row from a scanner. 也就是代表读取kudu数据后返回的行。其结构如下：
RowResult类图
其中， rowData存储定长数据，indirectData存储string和binary的变长数据。我们以long和string为例看看如何读取数据：
之所以要以long类型举例是因为这里有个特殊的地方, 还记得PartialRow中在rowAlloc中也为string和binary申请了空间，这个空间是用于存储变长数据的offset和length的，一共16个字节，8字节用于offset，8字节用于length。

 /**
   * Get the specified column's long
   *
   * If this is a UNIXTIME_MICROS column, the long value corresponds to a number of microseconds
   * since midnight, January 1, 1970 UTC.
   *
   * @param columnIndex Column index in the schema
   * @return a positive long
   * @throws IllegalArgumentException if the column is null
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public long getLong(int columnIndex) {
    checkValidColumn(columnIndex);
    checkNull(columnIndex);
    checkType(columnIndex, Type.INT64, Type.UNIXTIME_MICROS);
    return getLongOrOffset(columnIndex);
  }

  /**
   * Returns the long column value if the column type is INT64 or UNIXTIME_MICROS.
   * Returns the column's offset into the indirectData if the column type is BINARY or STRING.
   * @param columnIndex Column index in the schema
   * @return a long value for the column
   */
  long getLongOrOffset(int columnIndex) {
    return Bytes.getLong(this.rowData.getRawArray(),
        this.rowData.getRawOffset() +
            getCurrentRowDataOffsetForColumn(columnIndex));
  }

对于string数据，先是获取了offset，然后计算了再读取offset紧接着的8个字节数据长度，然后根据offset和length，以及基础的数据offset读取到数据。这里数据的最大长度为Integer.MAX_VALUE

/**
   * Get the specified column's string.
   * @param columnIndex Column index in the schema
   * @return a string
   * @throws IllegalArgumentException if the column is null
   * or if the type doesn't match the column's type
   * @throws IndexOutOfBoundsException if the column doesn't exist
   */
  public String getString(int columnIndex) {
    checkValidColumn(columnIndex);
    checkNull(columnIndex);
    checkType(columnIndex, Type.STRING);
    // C++ puts a Slice in rowData which is 16 bytes long for simplicity, but we only support ints.
    long offset = getLongOrOffset(columnIndex);
    long length = rowData.getLong(getCurrentRowDataOffsetForColumn(columnIndex) + 8);
    assert offset < Integer.MAX_VALUE;
    assert length < Integer.MAX_VALUE;
    return Bytes.getString(indirectData.getRawArray(),
                           indirectData.getRawOffset() + (int)offset,
                           (int)length);
  }

总结

以上就是kudu client中对于数据写入和读取的数据结构及读写逻辑的一个笔记。如有不当，欢迎指出。

彼岸枫雪非

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Kudu java 客户端Row 读写数据结构设计

Kudu java 客户端Partial Row数据结构设计文章目录Row 设计--PartialRowRow 数据写入--addXXX()数据读取--RowResult总结刚开始工作时有个研发大哥告诉，看源码时，首先要了解各部分的数据结构设计，搞清楚数据结构的设计对于理解代码很有帮助，我一直将这句话记在心里。此前因为工作需要，需要对kudu-client 的部分代码进行修改，以实现数据加解密。...
复制链接

扫一扫