Iceberg写入接口

1.写数据文件接口关系

  Iceberg写数据文件时的类调用关系:TaskWriter -> BaseRollingWriter -> FileWriter -> FileAppender -> Iceberg封装的文件类型实现类 -> 具体文件类型的实现类

2.TaskWriter

  根据对Flink的分析,Iceberg写入是基于TaskWriter进行的

public void processElement(StreamRecord<T> element) throws Exception {
  writer.write(element.getValue());
}

  TaskWriter有大量实现类,先关注Iceberg的基础实现类,iceberg-core里的通用实现类,主要有四个:BaseTaskWriter、UnpartitionedWriter、PartitionedWriter、PartitionedFanoutWriter
  BaseTaskWriter是基础实现类,不直接使用,另外三个都是BaseTaskWriter的子类。此外,Flink还自定义实现了子类
  UnpartitionedWriter是非分区写的实现
  PartitionedWriter、PartitionedFanoutWriter都是分区写的实现,每个分区Key都对应一个下层的做实际写操作的对象。不同的是,PartitionedWriter每次换分区都会关闭实际进行写操作的对象,而PartitionedFanoutWriter是做一个缓存,切换分区从缓存取

2.1.BaseTaskWriter

  BaseTaskWriter没有write接口的实现,但是第二层的BaseRollingWriter及其实现类都在BaseTaskWriter当中定义
  Flink由于其特殊需求,在BaseTaskWriter基础上实现了自己的子类

2.2.UnpartitionedWriter

  直接调用BaseRollingWriter的write接口进行写操作

public void write(T record) throws IOException {
  currentWriter.write(record);
}

  currentWriter就是下层的BaseRollingWriter,用的是RollingFileWriter

currentWriter = new RollingFileWriter(null);

2.3.PartitionedWriter

  与UnpartitionedWriter的不同在于,数据先进行分区,每个分区Key都会有一个独立的RollingFileWriter。在PartitionedWriter里,当切换分区Key写的时候,会关闭旧的RollingFileWriter

public void write(T row) throws IOException {
  PartitionKey key = partition(row);

  if (!key.equals(currentKey)) {
    if (currentKey != null) {
      // if the key is null, there was no previous current key and current writer.
      currentWriter.close();
      completedPartitions.add(currentKey);
    }

    if (completedPartitions.contains(key)) {
      // if rows are not correctly grouped, detect and fail the write
      PartitionKey existingKey = Iterables.find(completedPartitions, key::equals, null);
      LOG.warn("Duplicate key: {} == {}", existingKey, key);
      throw new IllegalStateException("Already closed files for partition: " + key.toPath());
    }

    currentKey = key.copy();
    currentWriter = new RollingFileWriter(currentKey);
  }

  currentWriter.write(row);
}

2.3.1.分区

  partition分区函数由子类实现,目前这个类的子类只有Spark的实现,也就是只有Spark用到了这个类

protected PartitionKey partition(InternalRow row) {
  partitionKey.partition(internalRowWrapper.wrap(row));
  return partitionKey;
}

  这里使用到了InternalRowWrapper,目的是做类型转换,将Spark的InternalRow类型转化为Iceberg的StructLike类型

2.3.2.PartitionKey(待补充)

  PartitionKey既是分区分配的具体实现类也是分区Key。两个PartitionKey比较是否相对的条件是内部的partitionTuple数组是否相等,partitionTuple就是一个Object数组,数组长度是分区列的数量

public boolean equals(Object o) {
  if (this == o) {
    return true;
  } else if (!(o instanceof PartitionKey)) {
    return false;
  }

  PartitionKey that = (PartitionKey) o;
  return Arrays.equals(partitionTuple, that.partitionTuple);
}

  partition接口做操作的时候实际改变的也就是partitionTuple这个数组(这一块需要再研究)

public void partition(StructLike row) {
  for (int i = 0; i < partitionTuple.length; i += 1) {
    Function<Object, Object> transform = transforms[i];
    partitionTuple[i] = transform.apply(accessors[i].get(row));
  }
}

2.4.PartitionedFanoutWriter

  与PartitionedWriter不同的就是RollingFileWriter这个不是每次关闭的,而是在缓存当中,需要的时候就去缓存获取

public void write(T row) throws IOException {
  PartitionKey partitionKey = partition(row);

  RollingFileWriter writer = writers.get(partitionKey);
  if (writer == null) {
    // NOTICE: we need to copy a new partition key here, in case of messing up the keys in
    // writers.
    PartitionKey copiedKey = partitionKey.copy();
    writer = new RollingFileWriter(copiedKey);
    writers.put(copiedKey, writer);
  }

  writer.write(row);
}

  其他的都是类似的,分区由子类来实现,Flink和Spark都有实现,使用的也是PartitionKey。差别就在于数据转换这一块,Flink用的是RowDataWrapper

3.TaskWriter组件实现子类

  每个组件根据自身的特点,对TaskWriter有不同的实现子类,实现类的实现指定了类处理的数据结构,Flink是RowData,Spark是InternalRow

3.1.Flink

  TaskWriter的Flink实现子类有四个,是通过RowDataTaskWriterFactory的工厂类创建的,在IcebergStreamWriter的open接口中创建,创建时根据条件有四个分支,分别对应不同的四个实现
  Flink在这里比较特殊,因为Update实现的特殊性,update采用的是一条删除消息一条插入消息的方式,所以写的实现类也特殊,还有专门的DeleteFile文件

if (equalityFieldIds == null || equalityFieldIds.isEmpty()) {
  // Initialize a task writer to write INSERT only.
  if (spec.isUnpartitioned()) {
    return new UnpartitionedWriter<>(
        spec, format, appenderFactory, outputFileFactory, io, targetFileSizeBytes);
  } else {
    return new RowDataPartitionedFanoutWriter(
        spec,
        format,
        appenderFactory,
        outputFileFactory,
        io,
        targetFileSizeBytes,
        schema,
        flinkSchema);
  }
} else {
  // Initialize a task writer to write both INSERT and equality DELETE.
  if (spec.isUnpartitioned()) {
    return new UnpartitionedDeltaWriter(
        spec,
        format,
        appenderFactory,
        outputFileFactory,
        io,
        targetFileSizeBytes,
        schema,
        flinkSchema,
        equalityFieldIds,
        upsert);
  } else {
    return new PartitionedDeltaWriter(
        spec,
        format,
        appenderFactory,
        outputFileFactory,
        io,
        targetFileSizeBytes,
        schema,
        flinkSchema,
        equalityFieldIds,
        upsert);
  }
}

  Flink这边是外层两个分支加内层两个分支,内部两个分支的实现类差别是分区与否;外层分支的差别就在于下一层的Write实现类。
  BaseRollingWriter 有一个RollingEqDeleteWriter的实现类,目前是只有Flink在用。BaseTaskWriter当中有  BaseEqualityDeltaWriter,只有Flink做了实现,这个类里面针对下一层的Write类,同时使用了BaseRollingWriter 的两个实现类,其中RollingEqDeleteWriter是用来记录DeleteFile的

protected BaseEqualityDeltaWriter(StructLike partition, Schema schema, Schema deleteSchema) {
  Preconditions.checkNotNull(schema, "Iceberg table schema cannot be null.");
  Preconditions.checkNotNull(deleteSchema, "Equality-delete schema cannot be null.");
  this.structProjection = StructProjection.create(schema, deleteSchema);

  this.dataWriter = new RollingFileWriter(partition);
  this.eqDeleteWriter = new RollingEqDeleteWriter(partition);
  this.posDeleteWriter =
      new SortedPosDeleteWriter<>(appenderFactory, fileFactory, format, partition);
  this.insertedRowMap = StructLikeMap.create(deleteSchema.asStruct());
}

3.2.Spark

  Spark的实现相对简单,就是直接对应的三个基础类,非分区类直接使用,两个分区类各自做了实现,子类主要就是分区接口的实现和类型转换类的使用

if (spec.isUnpartitioned()) {
  writer =
      new UnpartitionedWriter<>(
          spec, format, appenderFactory, fileFactory, table.io(), Long.MAX_VALUE);
} else if (PropertyUtil.propertyAsBoolean(
    properties,
    TableProperties.SPARK_WRITE_PARTITIONED_FANOUT_ENABLED,
    TableProperties.SPARK_WRITE_PARTITIONED_FANOUT_ENABLED_DEFAULT)) {
  writer =
      new SparkPartitionedFanoutWriter(
          spec,
          format,
          appenderFactory,
          fileFactory,
          table.io(),
          Long.MAX_VALUE,
          schema,
          structType);
} else {
  writer =
      new SparkPartitionedWriter(
          spec,
          format,
          appenderFactory,
          fileFactory,
          table.io(),
          Long.MAX_VALUE,
          schema,
          structType);
}

4.BaseRollingWriter

  TaskWriter的下一层Write调用就是BaseRollingWriter。前面提过,有两个实现类:RollingFileWriter和RollingEqDeleteWriter。通用的是RollingFileWriter,Flink还特殊使用了RollingEqDeleteWriter
  这一层进行的操作不多,write在父类中直接定义了

public void write(T record) throws IOException {
  write(currentWriter, record);
  this.currentRows++;

  if (shouldRollToNewFile()) {
    closeCurrent();
    openCurrent();
  }
}

  核心就是对currentWriter的定义,由newWriter创建,在子类中实现。接口的调用是在openCurrent接口当中

private void openCurrent() {
  if (partitionKey == null) {
    // unpartitioned
    this.currentFile = fileFactory.newOutputFile();
  } else {
    // partitioned
    this.currentFile = fileFactory.newOutputFile(partitionKey);
  }
  this.currentWriter = newWriter(currentFile, partitionKey);
  this.currentRows = 0;
}
RollingFileWriter
DataWriter<T> newWriter(EncryptedOutputFile file, StructLike partitionKey) {
  return appenderFactory.newDataWriter(file, format, partitionKey);
}
RollingEqDeleteWriter
EqualityDeleteWriter<T> newWriter(EncryptedOutputFile file, StructLike partitionKey) {
  return appenderFactory.newEqDeleteWriter(file, format, partitionKey);
}

4.1.newDataWriter

  Iceberg、Flink、Spark都有各自的实现,差异不大,就是创建一个DataWriter,在创建DataWriter前会创建FileAppender,这个是DataWriter的成员
  三者实现类的差异在于接口返回类的不同以及FileAppender创建时的核心Function的差异
  接口返回数据类型不相同

IcebergGenericAppenderFactory
public org.apache.iceberg.io.DataWriter<Record> newDataWriter(
FlinkFlinkAppenderFactory
public DataWriter<RowData> newDataWriter(
SparkSparkAppenderFactory
public DataWriter<InternalRow> newDataWriter(

4.2.newEqDeleteWriter

  newEqDeleteWriter也是Iceberg、Flink、Spark都有各自的实现,但是根据上层接口定义,应该只有Flink会用到
  整体上跟newDataWriter类似,就是产生的FileWriter类型不同。newDataWriter是DataWriter,newEqDeleteWriter是EqualityDeleteWriter

5.FileAppender

5.1.创建过程

  FileAppender以及是非常底层的写数据类了,以及设计到具体的文件类型。如前置章节所描述,在newDataWriter接口调用时创建,作为FileWriter的成员
  Iceberg、Flink、Spark三者在创建时差不多,只是创建时具体的FileAppender的配置上有差异
  整体就是根据文件类型,创建对应文件类型的FileAppender,Iceberg如下

switch (fileFormat) {
  case AVRO:
    return Avro.write(outputFile)
        .schema(schema)
        .createWriterFunc(DataWriter::create)
        .metricsConfig(metricsConfig)
        .setAll(config)
        .overwrite()
        .build();

  case PARQUET:
    return Parquet.write(outputFile)
        .schema(schema)
        .createWriterFunc(GenericParquetWriter::buildWriter)
        .setAll(config)
        .metricsConfig(metricsConfig)
        .overwrite()
        .build();

  case ORC:
    return ORC.write(outputFile)
        .schema(schema)
        .createWriterFunc(GenericOrcWriter::buildWriter)
        .setAll(config)
        .metricsConfig(metricsConfig)
        .overwrite()
        .build();

  default:
    throw new UnsupportedOperationException(
        "Cannot write unknown file format: " + fileFormat);
}

  差异主要在WriterFunc这里,Flink的Parquet如下

.createWriterFunc(msgType -> FlinkParquetWriters.buildWriter(flinkSchema, msgType))

  Spark的Parquet如下

.createWriterFunc(msgType -> SparkParquetWriters.buildWriter(dsSchema, msgType))

5.1.1 BloomFilter

  ​在parquet的FileAppender创建过程中,这边有设置BloomFilter的过滤器

for (Map.Entry<String, String> entry : columnBloomFilterEnabled.entrySet()) {
  String colPath = entry.getKey();
  String bloomEnabled = entry.getValue();
  propsBuilder.withBloomFilterEnabled(colPath, Boolean.valueOf(bloomEnabled));
}

5.2.写数据调用过程

  FileAppender之前的调用都是顺序调用接口的write接口,到FileAppender时变更为调用add接口。FileAppender是FileWrite的成员,DataWriter调用如下

public void write(T row) {
  appender.add(row);
}

  前面提到,FileAppender对不同的文件类型有不同的实现
在这里插入图片描述

  以Parquet为例,实现如下

public void add(T value) {
  recordCount += 1;
  model.write(0, value);
  writeStore.endRecord();
  checkSize();
}

  实际写用的是model,model的来源就是前面创建时传入的WriterFunc

this.model = (ParquetValueWriter<T>) createWriterFunc.apply(parquetSchema);

5.3.WriterFunc

  写数据的核心就是WriterFunc,以Flink为例来说,通过FlinkParquetWriters进行构建

public static <T> ParquetValueWriter<T> buildWriter(LogicalType schema, MessageType type) {
  return (ParquetValueWriter<T>)
      ParquetWithFlinkSchemaVisitor.visit(schema, type, new WriteBuilder(type));
}

  这里用的是访问者模式,最终返回给上层的函数由ParquetWithFlinkSchemaVisitor设定,根据传入的数据类型,有不同的实现

public static <T> T visit(
    LogicalType sType, Type type, ParquetWithFlinkSchemaVisitor<T> visitor) {
  Preconditions.checkArgument(sType != null, "Invalid DataType: null");
  if (type instanceof MessageType) {
    Preconditions.checkArgument(
        sType instanceof RowType, "Invalid struct: %s is not a struct", sType);
    RowType struct = (RowType) sType;
    return visitor.message(
        struct, (MessageType) type, visitFields(struct, type.asGroupType(), visitor));
  } else if (type.isPrimitive()) {
    return visitor.primitive(sType, type.asPrimitiveType());
  } else {

  重点关注primitive这个分支,这是基础类型的分支,实现最终还是在FlinkParquetWriters,在接口中,根据数据类型的不同,返回不同的读写器,以字符串为例,这几种统一使用一个类

case ENUM:
case JSON:
case UTF8:
  return strings(desc);

  具体创建的实现类就是StringDataWriter,这里针对不同的数据类型由不同的实现类

private static ParquetValueWriters.PrimitiveWriter<StringData> strings(ColumnDescriptor desc) {return new StringDataWriter(desc);}

  当写字符串类型的数据时,就会调用StringDataWriter的写接口进行写入

public void write(int repetitionLevel, StringData value) {
  column.writeBinary(repetitionLevel, Binary.fromReusedByteArray(value.toBytes()));
  stringDataFieldMetricsBuilder.addValue(value);
}

  这里的column就是Iceberg封装的Parquet的写入器ColumnWriter,它调用的就是第三方Parquet包的方法,调用的是二进制数据的写入接口

public void writeBinary(int rl, Binary value) {
  columnWriter.write(value, rl, maxDefinitionLevel);
}

6.元数据写入

  按照之前的Flink的流程分析,元数据写入分两阶段:1、Manifest文件写入;2、commit元数据的提交

6.1.Manifest文件写入

  Flink的调用链最终调用到FlinkManifestUtil的writeDataFiles,这里开始调用Iceberg实际的写入类,是公用方法

static ManifestFile writeDataFiles(
    OutputFile outputFile, PartitionSpec spec, List<DataFile> dataFiles) throws IOException {
  ManifestWriter<DataFile> writer =
      ManifestFiles.write(FORMAT_V2, spec, outputFile, DUMMY_SNAPSHOT_ID);

  try (ManifestWriter<DataFile> closeableWriter = writer) {
    closeableWriter.addAll(dataFiles);
  }

  return writer.toManifestFile();
}

6.1.1.ManifestWriter创建

  根据传入版本,这里Flink直接使用的是V2版本,创建对应的Write

public static ManifestWriter<DataFile> write(
    int formatVersion, PartitionSpec spec, OutputFile outputFile, Long snapshotId) {
  switch (formatVersion) {
    case 1:
      return new ManifestWriter.V1Writer(spec, outputFile, snapshotId);
    case 2:
      return new ManifestWriter.V2Writer(spec, outputFile, snapshotId);
  }
  throw new UnsupportedOperationException(
      "Cannot write manifest for table version: " + formatVersion);
}

6.1.2.写数据

  addAll接口写数据,最终还是调用到FileAppender的add接口,只是这里的实现类与前面写数据不一样,这边的实现类是ManifestWriter,调用到其addEntry接口

void addEntry(ManifestEntry<F> entry) {
  switch (entry.status()) {
    case ADDED:
      addedFiles += 1;
      addedRows += entry.file().recordCount();
      break;
    case EXISTING:
      existingFiles += 1;
      existingRows += entry.file().recordCount();
      break;
    case DELETED:
      deletedFiles += 1;
      deletedRows += entry.file().recordCount();
      break;
  }

  stats.update(entry.file().partition());

  if (entry.isLive()
      && entry.dataSequenceNumber() != null
      && (minDataSequenceNumber == null || entry.dataSequenceNumber() < minDataSequenceNumber)) {
    this.minDataSequenceNumber = entry.dataSequenceNumber();
  }

  writer.add(prepare(entry));
}

  最终使用writer进行写入,writer的创建如下,实现类是V2Writer(还有其他实现类)

protected FileAppender<ManifestEntry<DataFile>> newAppender(
    PartitionSpec spec, OutputFile file) {
  Schema manifestSchema = V2Metadata.entrySchema(spec.partitionType());
  try {
    return Avro.write(file)
        .schema(manifestSchema)
        .named("manifest_entry")
        .meta("schema", SchemaParser.toJson(spec.schema()))
        .meta("partition-spec", PartitionSpecParser.toJsonFields(spec))
        .meta("partition-spec-id", String.valueOf(spec.specId()))
        .meta("format-version", "2")
        .meta("content", "data")
        .overwrite()
        .build();
  } catch (IOException e) {
    throw new RuntimeIOException(e, "Failed to create manifest writer for path: %s", file);
  }
}

  最终还是FileAppender的实现类,此处的实现类就是AvroFileAppender,添加数据也是用的add接口

public void add(D datum) {
  try {
    numRecords += 1L;
    writer.append(datum);
  } catch (IOException e) {
    throw new RuntimeIOException(e);
  }
}

  这里的writer是第三方avro包中的类DataFileWriter,创建方式如下

private static <D> DataFileWriter<D> newAvroWriter(
    Schema schema,
    PositionOutputStream stream,
    DatumWriter<?> metricsAwareDatumWriter,
    CodecFactory codec,
    Map<String, String> metadata)
    throws IOException {
  DataFileWriter<D> writer = new DataFileWriter<>((DatumWriter<D>) metricsAwareDatumWriter);

  writer.setCodec(codec);

  for (Map.Entry<String, String> entry : metadata.entrySet()) {
    writer.setMeta(entry.getKey(), entry.getValue());
  }

  return writer.create(schema, stream);
}

  这里写入到的应该是manifes文件

private String generatePath(long checkpointId) {
  return FileFormat.AVRO.addExtension(
      String.format(
          "%s-%s-%05d-%d-%d-%05d",
          flinkJobId,
          operatorUniqueId,
          subTaskId,
          attemptNumber,
          checkpointId,
          fileCount.incrementAndGet()));
}

6.2.commit提交

  commit接口最终调用到的是Iceberg的公共类PendingUpdate,此处使用它的实现子类SnapshotProducer
  commit提交接口内分三个部分:1、配置设置;2、元数据对象构建;3、提交

6.2.1.配置设置

  这里设置的是元数据提交冲突的处理设置,重试次数和等待时间等

Tasks.foreach(ops)
    .retry(base.propertyAsInt(COMMIT_NUM_RETRIES, COMMIT_NUM_RETRIES_DEFAULT))
    .exponentialBackoff(
        base.propertyAsInt(COMMIT_MIN_RETRY_WAIT_MS, COMMIT_MIN_RETRY_WAIT_MS_DEFAULT),
        base.propertyAsInt(COMMIT_MAX_RETRY_WAIT_MS, COMMIT_MAX_RETRY_WAIT_MS_DEFAULT),
        base.propertyAsInt(COMMIT_TOTAL_RETRY_TIME_MS, COMMIT_TOTAL_RETRY_TIME_MS_DEFAULT),
        2.0 /* exponential */)

6.2.2.元数据构建

  元数据构建的核心就是TableMetadata对象的构建,基于Snapshot完成

Snapshot newSnapshot = apply();
newSnapshotId.set(newSnapshot.snapshotId());
TableMetadata.Builder update = TableMetadata.buildFrom(base);
if (base.snapshot(newSnapshot.snapshotId()) != null) {
  // this is a rollback operation
  update.setBranchSnapshot(newSnapshot.snapshotId(), targetBranch);
} else if (stageOnly) {
  update.addSnapshot(newSnapshot);
} else {
  update.setBranchSnapshot(newSnapshot, targetBranch);
}

TableMetadata updated = update.build();

  这里apply构建新的Snapshot,其基础是先去元数据中获取旧的Snapshot并根据规则产生新的Snapshot
  Snapshot的获取基于TableOperations完成(后续元数据提交也基于它完成),以Hive来说,就是HiveTableOperations,使用Hive的IMetaStoreClient获取Hive中的元数据信息

IMetaStoreClient msc = null;
try {
  Table table;
  if (!isSecurityMode) {
    table = metaClients.run(client -> client.getTable(catalogName, database, tableName));
  } else {
    msc = new HiveMetaStoreClient(conf);
    table = msc.getTable(catalogName, database, tableName);
  }

  之后会基于旧的Snapshot产生一个新的ID码

long sequenceNumber = base.nextSequenceNumber();

  之后会构建新的manifes文件路径,这里写入的是snap文件,仍然是基于FileAppender的子类进行的写入

protected OutputFile manifestListPath() {
  return ops.io()
      .newOutputFile(
          ops.metadataFileLocation(
              FileFormat.AVRO.addExtension(
                  String.format(
                      "snap-%d-%d-%s", snapshotId(), attempt.incrementAndGet(), commitUUID))));
}

6.2.3.commit

  最终做commit提交,根据不同catalog有不同的实现,基于TableOperations,以Hive为例,调用到HiveTableOperations的doCommit
  前面讲commit冲突的时候已经描述过整体的过程,这里会写元数据metadata.json文件并提交到hive中
  文件地址构建如下

private String newTableMetadataFilePath(TableMetadata meta, int newVersion) {
  String codecName =
      meta.property(
          TableProperties.METADATA_COMPRESSION, TableProperties.METADATA_COMPRESSION_DEFAULT);
  String fileExtension = TableMetadataParser.getFileExtension(codecName);
  return metadataFileLocation(
      meta, String.format("%05d-%s%s", newVersion, UUID.randomUUID(), fileExtension));
}

  Hadoop的文件名相对简单

private Path metadataFilePath(int metadataVersion, TableMetadataParser.Codec codec) {
  return metadataPath("v" + metadataVersion + TableMetadataParser.getFileExtension(codec));
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值