一个Null引发的Arrow Ipc通信Bug-CSDN博客

本文链接：https://blog.csdn.net/guangcheng0312q/article/details/129292691

一个Null引发的Arrow Ipc通信Bug

0.导语

今天在测试代码时，刚好命中一个case。

假设，表t2有两列，分别是i, j，如果select后面的都涉及i，j是没有问题的，但是如果只涉及一列就有问题。

select count(i) from t2;

此时会报错：

[record-batch-reader][read-next]: IOError: buffer_index out of range.

这个错误来自于arrow的reader接口，为了解决这个问题，需要理清楚Arrow的Ipc通信机制，没办法了，只能硬着头皮看代码了。

1.Ipc通信原理浅析

本节只是对涉及到的点进行阐述，具体Ipc实现细节就没看完，暂时先不多说了，后面有时间了多看看。

1.1 发送端

在Arrow里面有两个文件，分别是writer.cc(发送端)与reader.cc(接收端)。

Writer.cc核心是IpcPayload结构：

struct IpcPayload {
  MessageType type = MessageType::NONE;
  std::shared_ptr<Buffer> metadata;
  std::vector<std::shared_ptr<Buffer>> body_buffers;
  int64_t body_length = 0;
};

在这个结构中与我们数据相关联的就是body_buffers。

例如：从scan出来了两列数据，i, j。正常情况下，如果查询是select count(i+h) 之类的会涉及多列，从scan出来数据是全的，也就是i,j两列都有数据，回到这个例子是select count(i)，从scan出来只有i列有数据，但是在scan还是会返回两列，只不过第二列j会用null arry进行填充。

那么在Ipc通信发送数据时，IpcPayload的body_buffers内容便会变为:

[null_bitmap_i, data_i]

如果是i，j两列都返回数据，则是:

[null_bitmap_i, data_i, null_bitmap_j, data_j]

所以，简单来说就是null array并不会填充数据到body_buffers中。

当然还会涉及到数据的序列化动作，这里就不多阐述了，具体可以看源码实现。

1.2 接收端

接收端的话，简单来说就是读到recordbatch，进行反序列化，这里采用了FlatBuffers序列化开源库，这里对应的schema文件时Message.fbs。

简单看看：

/// A data header describing the shared memory layout of a "record" or "row"
/// batch. Some systems call this a "row batch" internally and others a "record
/// batch".
table RecordBatch {
  /// number of records / rows. The arrays in the batch should all have this
  /// length
  length: long;

  /// Nodes correspond to the pre-ordered flattened logical schema
  nodes: [FieldNode];

  /// Buffers correspond to the pre-ordered flattened buffer tree
  ///
  /// The number of buffers appended to this list depends on the schema. For
  /// example, most primitive arrays will have 2 buffers, 1 for the validity
  /// bitmap and 1 for the values. For struct arrays, there will only be a
  /// single buffer for the validity (nulls) bitmap
  buffers: [Buffer];

  /// Optional compression of the message body
  compression: BodyCompression;
}

定义好之后，框架会生成对应的实现，而接收端会有个ArrayLoader类，里面有个比较重要的成员，叫做：

const flatbuf::RecordBatch* metadata_;

可以看到这里就是采用了上述的flatbuf生成的代码，我们代码的报错位置也是这个相关联，代码如下:

Status GetBuffer(int buffer_index, std::shared_ptr<Buffer>* out) {
    auto buffers = metadata_->buffers();
    CHECK_FLATBUFFERS_NOT_NULL(buffers, "RecordBatch.buffers");
    if (buffer_index >= static_cast<int>(buffers->size())) {
      // here!!!
      return Status::IOError("buffer_index out of range.");
    }
    // do something
  }

那么这个buffers到底是什么呢？其实就是对应上述的发送端buffer。通过gdb我们可以发现第二次进来的是bufer_index = 2，而buffers->size()，正常第二次不应该进入，为何会出现这个问题？

这里引出另外一个前置问题：为何这个函数调用两次？

查看堆栈，可以看到有field逻辑，所以f 6上去看看逻辑。

#0  arrow::ipc::ArrayLoader::GetBuffer(int, std::shared_ptr<arrow::Buffer>*)
    (this=0x7ffe2e39b420, buffer_index=1, out=0x8f459f0)
    at /code/arrow/cpp/src/arrow/ipc/reader.cc:191
#1  0x00007f322451db26 in arrow::ipc::ArrayLoader::LoadPrimitive<arrow::Int32Type>(arrow::Type::type) (this=0x7ffe2e39b420, type_id=arrow::Type::INT32) at /code/arrow/cpp/src/arrow/ipc/reader.cc:244
#2  0x00007f3224517545 in arrow::ipc::ArrayLoader::Visit<arrow::Int32Type>(arrow::Int32Type const&)
    (this=0x7ffe2e39b420, type=...) at /code/arrow/cpp/src/arrow/ipc/reader.cc:303
#3  0x00007f3224511be8 in arrow::VisitTypeInline<arrow::ipc::ArrayLoader>(arrow::DataType const&, arrow::ipc::ArrayLoader*) (type=..., visitor=0x7ffe2e39b420)
    at /code/arrow/cpp/src/arrow/visitor_inline.h:90
#4  0x00007f322450b509 in arrow::ipc::ArrayLoader::LoadType(arrow::DataType const&)
    (this=0x7ffe2e39b420, type=...) at /code/arrow/cpp/src/arrow/ipc/reader.cc:169
#5  0x00007f322450b5b4 in arrow::ipc::ArrayLoader::Load(arrow::Field const*, arrow::ArrayData*)
    (this=0x7ffe2e39b420, field=0x8f451c0, out=0x8f45970)
    at /code/arrow/cpp/src/arrow/ipc/reader.cc:179
#6  0x00007f32244f9909 in arrow::ipc::LoadRecordBatchSubset(org::apache::arrow::flatbuf::RecordBatch const*, std::shared_ptr<arrow::Schema> const&, std::vector<bool, std::allocator<bool> > const*, arrow::ipc::IpcReadContext const&, arrow::io::RandomAccessFile*) (metadata=0x8f45734, schema

LoadRecordBatchSubset的逻辑大概是：

for (int i = 0; i < schema->num_fields(); ++i) {
  const Field& field = *schema->field(i);
  if (!inclusion_mask || (*inclusion_mask)[i]) {
    // important!!!!
    RETURN_NOT_OK(loader.Load(&field, column.get()));  
    // do something
  }
}

其他不重要代码都干掉了，只留下了Load这一行，可以看到从schema中读取了不同的field，由于有两列所以就调用了两次的GetBuffer，那么问题就要么是因为schema引起的，要么是Load到GetBuffer中间逻辑出了问题。

Status LoadType(const DataType& type) { return VisitTypeInline(type, this); }

Status Load(const Field* field, ArrayData* out) {
  if (max_recursion_depth_ <= 0) {
    return Status::Invalid("Max recursion depth reached");
  }

  field_ = field;
  out_ = out;
  out_->type = field_->type();
  return LoadType(*field_->type());
}

最后定位到VisitTypeInline函数，它会调用不同的Visit函数，例如：

// NullType
Status Visit(const NullType& type) {
  out_->buffers.resize(1);

  // ARROW-6379: NullType has no buffers in the IPC payload
  return GetFieldMetadata(field_index_++, out_);
}

// FixedSizeBinaryType
Status Visit(const FixedSizeBinaryType& type) {
  out_->buffers.resize(2);
  RETURN_NOT_OK(LoadCommon(type.id()));
  return GetBuffer(buffer_index_++, &out_->buffers[1]);
}

template <typename T>
enable_if_t<std::is_base_of<FixedWidthType, T>::value &&
                !std::is_base_of<FixedSizeBinaryType, T>::value &&
                !std::is_base_of<DictionaryType, T>::value,
            Status>
Visit(const T& type) {
  return LoadPrimitive<T>(type.id());
}

可以看到是根据recordbatch当中的schema中的每个field的type决定，正常应该会调用上述第三个Visit(i列)与第一个Visit(j列，由于是个空数组)，实际却是第三个调用了两次，所以会进入GetBuffer两次，从而导致buffer_index越界。

现在问题确定了是类型引起，所以回过头来看schema构建逻辑即可，发现却是代码写的又问题。

garrow_store_func_ptr(datatype, PGTypeToArrow(atttypid));

可以看到这里并不是null type，因此转换为下面代码即可fix掉。

garrow_store_func_ptr(datatype, (GArrowDataType *) garrow_null_data_type_new());

本节一行代码引入了一个难以发现的问题，而且这种问题关键是需要深入Arrow才可以理解，看来得多学习学习了，本节完~