1.写数据文件接口关系
Iceberg写数据文件时的类调用关系:TaskWriter -> BaseRollingWriter -> FileWriter -> FileAppender -> Iceberg封装的文件类型实现类 -> 具体文件类型的实现类
2.TaskWriter
根据对Flink的分析,Iceberg写入是基于TaskWriter进行的
public void processElement(StreamRecord<T> element) throws Exception {
writer.write(element.getValue());
}
TaskWriter有大量实现类,先关注Iceberg的基础实现类,iceberg-core里的通用实现类,主要有四个:BaseTaskWriter、UnpartitionedWriter、PartitionedWriter、PartitionedFanoutWriter
BaseTaskWriter是基础实现类,不直接使用,另外三个都是BaseTaskWriter的子类。此外,Flink还自定义实现了子类
UnpartitionedWriter是非分区写的实现
PartitionedWriter、PartitionedFanoutWriter都是分区写的实现,每个分区Key都对应一个下层的做实际写操作的对象。不同的是,PartitionedWriter每次换分区都会关闭实际进行写操作的对象,而PartitionedFanoutWriter是做一个缓存,切换分区从缓存取
2.1.BaseTaskWriter
BaseTaskWriter没有write接口的实现,但是第二层的BaseRollingWriter及其实现类都在BaseTaskWriter当中定义
Flink由于其特殊需求,在BaseTaskWriter基础上实现了自己的子类
2.2.UnpartitionedWriter
直接调用BaseRollingWriter的write接口进行写操作
public void write(T record) throws IOException {
currentWriter.write(record);
}
currentWriter就是下层的BaseRollingWriter,用的是RollingFileWriter
currentWriter = new RollingFileWriter(null);
2.3.PartitionedWriter
与UnpartitionedWriter的不同在于,数据先进行分区,每个分区Key都会有一个独立的RollingFileWriter。在PartitionedWriter里,当切换分区Key写的时候,会关闭旧的RollingFileWriter
public void write(T row) throws IOException {
PartitionKey key = partition(row);
if (!key.equals(currentKey)) {
if (currentKey != null) {
// if the key is null, there was no previous current key and current writer.
currentWriter.close();
completedPartitions.add(currentKey);
}
if (completedPartitions.contains(key)) {
// if rows are not correctly grouped, detect and fail the write
PartitionKey existingKey = Iterables.find(completedPartitions, key::equals, null);
LOG.warn("Duplicate key: {} == {}", existingKey, key);
throw new IllegalStateException("Already closed files for partition: " + key.toPath());
}
currentKey = key.copy();
currentWriter = new RollingFileWriter(currentKey);
}
currentWriter.write(row);
}
2.3.1.分区
partition分区函数由子类实现,目前这个类的子类只有Spark的实现,也就是只有Spark用到了这个类
protected PartitionKey partition(InternalRow row) {
partitionKey.partition(internalRowWrapper.wrap(row));
return partitionKey;
}
这里使用到了InternalRowWrapper,目的是做类型转换,将Spark的InternalRow类型转化为Iceberg的StructLike类型
2.3.2.PartitionKey(待补充)
PartitionKey既是分区分配的具体实现类也是分区Key。两个PartitionKey比较是否相对的条件是内部的partitionTuple数组是否相等,partitionTuple就是一个Object数组,数组长度是分区列的数量
public boolean equals(Object o) {
if (this == o) {
return true;
} else if (!(o instanceof PartitionKey)) {
return false;
}
PartitionKey that = (PartitionKey) o;
return Arrays.equals(partitionTuple, that.partitionTuple);
}
partition接口做操作的时候实际改变的也就是partitionTuple这个数组(这一块需要再研究)
public void partition(StructLike row) {
for (int i = 0; i < partitionTuple.length; i += 1) {
Function<Object, Object> transform = transforms[i];
partitionTuple[i] = transform.apply(accessors[i].get(row));
}
}
2.4.PartitionedFanoutWriter
与PartitionedWriter不同的就是RollingFileWriter这个不是每次关闭的,而是在缓存当中,需要的时候就去缓存获取
public void write(T row) throws IOException {
PartitionKey partitionKey = partition(row);
RollingFileWriter writer = writers.get(partitionKey);
if (writer == null) {
// NOTICE: we need to copy a new partition key here, in case of messing up the keys in
// writers.
PartitionKey copiedKey = partitionKey.copy();
writer = new RollingFileWriter(copiedKey);
writers.put(copiedKey, writer);
}
writer.write(row);
}
其他的都是类似的,分区由子类来实现,Flink和Spark都有实现,使用的也是PartitionKey。差别就在于数据转换这一块,Flink用的是RowDataWrapper
3.TaskWriter组件实现子类
每个组件根据自身的特点,对TaskWriter有不同的实现子类,实现类的实现指定了类处理的数据结构,Flink是RowData,Spark是InternalRow
3.1.Flink
TaskWriter的Flink实现子类有四个,是通过RowDataTaskWriterFactory的工厂类创建的,在IcebergStreamWriter的open接口中创建,创建时根据条件有四个分支,分别对应不同的四个实现
Flink在这里比较特殊,因为Update实现的特殊性,update采用的是一条删除消息一条插入消息的方式,所以写的实现类也特殊,还有专门的DeleteFile文件
if (equalityFieldIds == null || equalityFieldIds.isEmpty()) {
// Initialize a task writer to write INSERT only.
if (spec.isUnpartitioned()) {
return new UnpartitionedWriter<>(
spec, format, appenderFactory, outputFileFactory, io, targetFileSizeBytes);
} else {
return new RowDataPartitionedFanoutWriter(
spec,
format,
appenderFactory,
outputFileFactory,
io,
targetFileSizeBytes,
schema,
flinkSchema);
}
} else {
// Initialize a task writer to write both INSERT and equality DELETE.
if (spec.isUnpartitioned()) {
return new UnpartitionedDeltaWriter(
spec,
format,
appenderFactory,
outputFileFactory,
io,
targetFileSizeBytes,
schema,
flinkSchema,
equalityFieldIds,
upsert);
} else {
return new PartitionedDeltaWriter(
spec,
format,
appenderFactory,
outputFileFactory,
io,
targetFileSizeBytes,
schema,
flinkSchema,
equalityFieldIds,
upsert);
}
}
Flink这边是外层两个分支加内层两个分支,内部两个分支的实现类差别是分区与否;外层分支的差别就在于下一层的Write实现类。
BaseRollingWriter 有一个RollingEqDeleteWriter的实现类,目前是只有Flink在用。BaseTaskWriter当中有 BaseEqualityDeltaWriter,只有Flink做了实现,这个类里面针对下一层的Write类,同时使用了BaseRollingWriter 的两个实现类,其中RollingEqDeleteWriter是用来记录DeleteFile的
protected BaseEqualityDeltaWriter(StructLike partition, Schema schema, Schema deleteSchema) {
Preconditions.checkNotNull(schema, "Iceberg table schema cannot be null.");
Preconditions.checkNotNull(deleteSchema, "Equality-delete schema cannot be null.");
this.structProjection = StructProjection.create(schema, deleteSchema);
this.dataWriter = new RollingFileWriter(partition);
this.eqDeleteWriter = new RollingEqDeleteWriter(partition);
this.posDeleteWriter =
new SortedPosDeleteWriter<>(appenderFactory, fileFactory, format, partition);
this.insertedRowMap = StructLikeMap.create(deleteSchema.asStruct());
}
3.2.Spark
Spark的实现相对简单,就是直接对应的三个基础类,非分区类直接使用,两个分区类各自做了实现,子类主要就是分区接口的实现和类型转换类的使用
if (spec.isUnpartitioned()) {
writer =
new UnpartitionedWriter<>(
spec, format, appenderFactory, fileFactory, table.io(), Long.MAX_VALUE);
} else if (PropertyUtil.propertyAsBoolean(
properties,
TableProperties.SPARK_WRITE_PARTITIONED_FANOUT_ENABLED,
TableProperties.SPARK_WRITE_PARTITIONED_FANOUT_ENABLED_DEFAULT)) {
writer =
new SparkPartitionedFanoutWriter(
spec,
format,
appenderFactory,
fileFactory,
table.io(),
Long.MAX_VALUE,
schema,
structType);
} else {
writer =
new SparkPartitionedWriter(
spec,
format,
appenderFactory,
fileFactory,
table.io(),
Long.MAX_VALUE,
schema,
structType);
}
4.BaseRollingWriter
TaskWriter的下一层Write调用就是BaseRollingWriter。前面提过,有两个实现类:RollingFileWriter和RollingEqDeleteWriter。通用的是RollingFileWriter,Flink还特殊使用了RollingEqDeleteWriter
这一层进行的操作不多,write在父类中直接定义了
public void write(T record) throws IOException {
write(currentWriter, record);
this.currentRows++;
if (shouldRollToNewFile()) {
closeCurrent();
openCurrent();
}
}
核心就是对currentWriter的定义,由newWriter创建,在子类中实现。接口的调用是在openCurrent接口当中
private void openCurrent() {
if (partitionKey == null) {
// unpartitioned
this.currentFile = fileFactory.newOutputFile();
} else {
// partitioned
this.currentFile = fileFactory.newOutputFile(partitionKey);
}
this.currentWriter = newWriter(currentFile, partitionKey);
this.currentRows = 0;
}
RollingFileWriter
DataWriter<T> newWriter(EncryptedOutputFile file, StructLike partitionKey) {
return appenderFactory.newDataWriter(file, format, partitionKey);
}
RollingEqDeleteWriter
EqualityDeleteWriter<T> newWriter(EncryptedOutputFile file, StructLike partitionKey) {
return appenderFactory.newEqDeleteWriter(file, format, partitionKey);
}
4.1.newDataWriter
Iceberg、Flink、Spark都有各自的实现,差异不大,就是创建一个DataWriter,在创建DataWriter前会创建FileAppender,这个是DataWriter的成员
三者实现类的差异在于接口返回类的不同以及FileAppender创建时的核心Function的差异
接口返回数据类型不相同
Iceberg的GenericAppenderFactory
public org.apache.iceberg.io.DataWriter<Record> newDataWriter(
Flink的FlinkAppenderFactory
public DataWriter<RowData> newDataWriter(
Spark的SparkAppenderFactory
public DataWriter<InternalRow> newDataWriter(
4.2.newEqDeleteWriter
newEqDeleteWriter也是Iceberg、Flink、Spark都有各自的实现,但是根据上层接口定义,应该只有Flink会用到
整体上跟newDataWriter类似,就是产生的FileWriter类型不同。newDataWriter是DataWriter,newEqDeleteWriter是EqualityDeleteWriter
5.FileAppender
5.1.创建过程
FileAppender以及是非常底层的写数据类了,以及设计到具体的文件类型。如前置章节所描述,在newDataWriter接口调用时创建,作为FileWriter的成员
Iceberg、Flink、Spark三者在创建时差不多,只是创建时具体的FileAppender的配置上有差异
整体就是根据文件类型,创建对应文件类型的FileAppender,Iceberg如下
switch (fileFormat) {
case AVRO:
return Avro.write(outputFile)
.schema(schema)
.createWriterFunc(DataWriter::create)
.metricsConfig(metricsConfig)
.setAll(config)
.overwrite()
.build();
case PARQUET:
return Parquet.write(outputFile)
.schema(schema)
.createWriterFunc(GenericParquetWriter::buildWriter)
.setAll(config)
.metricsConfig(metricsConfig)
.overwrite()
.build();
case ORC:
return ORC.write(outputFile)
.schema(schema)
.createWriterFunc(GenericOrcWriter::buildWriter)
.setAll(config)
.metricsConfig(metricsConfig)
.overwrite()
.build();
default:
throw new UnsupportedOperationException(
"Cannot write unknown file format: " + fileFormat);
}
差异主要在WriterFunc这里,Flink的Parquet如下
.createWriterFunc(msgType -> FlinkParquetWriters.buildWriter(flinkSchema, msgType))
Spark的Parquet如下
.createWriterFunc(msgType -> SparkParquetWriters.buildWriter(dsSchema, msgType))
5.1.1 BloomFilter
在parquet的FileAppender创建过程中,这边有设置BloomFilter的过滤器
for (Map.Entry<String, String> entry : columnBloomFilterEnabled.entrySet()) {
String colPath = entry.getKey();
String bloomEnabled = entry.getValue();
propsBuilder.withBloomFilterEnabled(colPath, Boolean.valueOf(bloomEnabled));
}
5.2.写数据调用过程
FileAppender之前的调用都是顺序调用接口的write接口,到FileAppender时变更为调用add接口。FileAppender是FileWrite的成员,DataWriter调用如下
public void write(T row) {
appender.add(row);
}
前面提到,FileAppender对不同的文件类型有不同的实现
以Parquet为例,实现如下
public void add(T value) {
recordCount += 1;
model.write(0, value);
writeStore.endRecord();
checkSize();
}
实际写用的是model,model的来源就是前面创建时传入的WriterFunc
this.model = (ParquetValueWriter<T>) createWriterFunc.apply(parquetSchema);
5.3.WriterFunc
写数据的核心就是WriterFunc,以Flink为例来说,通过FlinkParquetWriters进行构建
public static <T> ParquetValueWriter<T> buildWriter(LogicalType schema, MessageType type) {
return (ParquetValueWriter<T>)
ParquetWithFlinkSchemaVisitor.visit(schema, type, new WriteBuilder(type));
}
这里用的是访问者模式,最终返回给上层的函数由ParquetWithFlinkSchemaVisitor设定,根据传入的数据类型,有不同的实现
public static <T> T visit(
LogicalType sType, Type type, ParquetWithFlinkSchemaVisitor<T> visitor) {
Preconditions.checkArgument(sType != null, "Invalid DataType: null");
if (type instanceof MessageType) {
Preconditions.checkArgument(
sType instanceof RowType, "Invalid struct: %s is not a struct", sType);
RowType struct = (RowType) sType;
return visitor.message(
struct, (MessageType) type, visitFields(struct, type.asGroupType(), visitor));
} else if (type.isPrimitive()) {
return visitor.primitive(sType, type.asPrimitiveType());
} else {
重点关注primitive这个分支,这是基础类型的分支,实现最终还是在FlinkParquetWriters,在接口中,根据数据类型的不同,返回不同的读写器,以字符串为例,这几种统一使用一个类
case ENUM:
case JSON:
case UTF8:
return strings(desc);
具体创建的实现类就是StringDataWriter,这里针对不同的数据类型由不同的实现类
private static ParquetValueWriters.PrimitiveWriter<StringData> strings(ColumnDescriptor desc) {return new StringDataWriter(desc);}
当写字符串类型的数据时,就会调用StringDataWriter的写接口进行写入
public void write(int repetitionLevel, StringData value) {
column.writeBinary(repetitionLevel, Binary.fromReusedByteArray(value.toBytes()));
stringDataFieldMetricsBuilder.addValue(value);
}
这里的column就是Iceberg封装的Parquet的写入器ColumnWriter,它调用的就是第三方Parquet包的方法,调用的是二进制数据的写入接口
public void writeBinary(int rl, Binary value) {
columnWriter.write(value, rl, maxDefinitionLevel);
}
6.元数据写入
按照之前的Flink的流程分析,元数据写入分两阶段:1、Manifest文件写入;2、commit元数据的提交
6.1.Manifest文件写入
Flink的调用链最终调用到FlinkManifestUtil的writeDataFiles,这里开始调用Iceberg实际的写入类,是公用方法
static ManifestFile writeDataFiles(
OutputFile outputFile, PartitionSpec spec, List<DataFile> dataFiles) throws IOException {
ManifestWriter<DataFile> writer =
ManifestFiles.write(FORMAT_V2, spec, outputFile, DUMMY_SNAPSHOT_ID);
try (ManifestWriter<DataFile> closeableWriter = writer) {
closeableWriter.addAll(dataFiles);
}
return writer.toManifestFile();
}
6.1.1.ManifestWriter创建
根据传入版本,这里Flink直接使用的是V2版本,创建对应的Write
public static ManifestWriter<DataFile> write(
int formatVersion, PartitionSpec spec, OutputFile outputFile, Long snapshotId) {
switch (formatVersion) {
case 1:
return new ManifestWriter.V1Writer(spec, outputFile, snapshotId);
case 2:
return new ManifestWriter.V2Writer(spec, outputFile, snapshotId);
}
throw new UnsupportedOperationException(
"Cannot write manifest for table version: " + formatVersion);
}
6.1.2.写数据
addAll接口写数据,最终还是调用到FileAppender的add接口,只是这里的实现类与前面写数据不一样,这边的实现类是ManifestWriter,调用到其addEntry接口
void addEntry(ManifestEntry<F> entry) {
switch (entry.status()) {
case ADDED:
addedFiles += 1;
addedRows += entry.file().recordCount();
break;
case EXISTING:
existingFiles += 1;
existingRows += entry.file().recordCount();
break;
case DELETED:
deletedFiles += 1;
deletedRows += entry.file().recordCount();
break;
}
stats.update(entry.file().partition());
if (entry.isLive()
&& entry.dataSequenceNumber() != null
&& (minDataSequenceNumber == null || entry.dataSequenceNumber() < minDataSequenceNumber)) {
this.minDataSequenceNumber = entry.dataSequenceNumber();
}
writer.add(prepare(entry));
}
最终使用writer进行写入,writer的创建如下,实现类是V2Writer(还有其他实现类)
protected FileAppender<ManifestEntry<DataFile>> newAppender(
PartitionSpec spec, OutputFile file) {
Schema manifestSchema = V2Metadata.entrySchema(spec.partitionType());
try {
return Avro.write(file)
.schema(manifestSchema)
.named("manifest_entry")
.meta("schema", SchemaParser.toJson(spec.schema()))
.meta("partition-spec", PartitionSpecParser.toJsonFields(spec))
.meta("partition-spec-id", String.valueOf(spec.specId()))
.meta("format-version", "2")
.meta("content", "data")
.overwrite()
.build();
} catch (IOException e) {
throw new RuntimeIOException(e, "Failed to create manifest writer for path: %s", file);
}
}
最终还是FileAppender的实现类,此处的实现类就是AvroFileAppender,添加数据也是用的add接口
public void add(D datum) {
try {
numRecords += 1L;
writer.append(datum);
} catch (IOException e) {
throw new RuntimeIOException(e);
}
}
这里的writer是第三方avro包中的类DataFileWriter,创建方式如下
private static <D> DataFileWriter<D> newAvroWriter(
Schema schema,
PositionOutputStream stream,
DatumWriter<?> metricsAwareDatumWriter,
CodecFactory codec,
Map<String, String> metadata)
throws IOException {
DataFileWriter<D> writer = new DataFileWriter<>((DatumWriter<D>) metricsAwareDatumWriter);
writer.setCodec(codec);
for (Map.Entry<String, String> entry : metadata.entrySet()) {
writer.setMeta(entry.getKey(), entry.getValue());
}
return writer.create(schema, stream);
}
这里写入到的应该是manifes文件
private String generatePath(long checkpointId) {
return FileFormat.AVRO.addExtension(
String.format(
"%s-%s-%05d-%d-%d-%05d",
flinkJobId,
operatorUniqueId,
subTaskId,
attemptNumber,
checkpointId,
fileCount.incrementAndGet()));
}
6.2.commit提交
commit接口最终调用到的是Iceberg的公共类PendingUpdate,此处使用它的实现子类SnapshotProducer
commit提交接口内分三个部分:1、配置设置;2、元数据对象构建;3、提交
6.2.1.配置设置
这里设置的是元数据提交冲突的处理设置,重试次数和等待时间等
Tasks.foreach(ops)
.retry(base.propertyAsInt(COMMIT_NUM_RETRIES, COMMIT_NUM_RETRIES_DEFAULT))
.exponentialBackoff(
base.propertyAsInt(COMMIT_MIN_RETRY_WAIT_MS, COMMIT_MIN_RETRY_WAIT_MS_DEFAULT),
base.propertyAsInt(COMMIT_MAX_RETRY_WAIT_MS, COMMIT_MAX_RETRY_WAIT_MS_DEFAULT),
base.propertyAsInt(COMMIT_TOTAL_RETRY_TIME_MS, COMMIT_TOTAL_RETRY_TIME_MS_DEFAULT),
2.0 /* exponential */)
6.2.2.元数据构建
元数据构建的核心就是TableMetadata对象的构建,基于Snapshot完成
Snapshot newSnapshot = apply();
newSnapshotId.set(newSnapshot.snapshotId());
TableMetadata.Builder update = TableMetadata.buildFrom(base);
if (base.snapshot(newSnapshot.snapshotId()) != null) {
// this is a rollback operation
update.setBranchSnapshot(newSnapshot.snapshotId(), targetBranch);
} else if (stageOnly) {
update.addSnapshot(newSnapshot);
} else {
update.setBranchSnapshot(newSnapshot, targetBranch);
}
TableMetadata updated = update.build();
这里apply构建新的Snapshot,其基础是先去元数据中获取旧的Snapshot并根据规则产生新的Snapshot
Snapshot的获取基于TableOperations完成(后续元数据提交也基于它完成),以Hive来说,就是HiveTableOperations,使用Hive的IMetaStoreClient获取Hive中的元数据信息
IMetaStoreClient msc = null;
try {
Table table;
if (!isSecurityMode) {
table = metaClients.run(client -> client.getTable(catalogName, database, tableName));
} else {
msc = new HiveMetaStoreClient(conf);
table = msc.getTable(catalogName, database, tableName);
}
之后会基于旧的Snapshot产生一个新的ID码
long sequenceNumber = base.nextSequenceNumber();
之后会构建新的manifes文件路径,这里写入的是snap文件,仍然是基于FileAppender的子类进行的写入
protected OutputFile manifestListPath() {
return ops.io()
.newOutputFile(
ops.metadataFileLocation(
FileFormat.AVRO.addExtension(
String.format(
"snap-%d-%d-%s", snapshotId(), attempt.incrementAndGet(), commitUUID))));
}
6.2.3.commit
最终做commit提交,根据不同catalog有不同的实现,基于TableOperations,以Hive为例,调用到HiveTableOperations的doCommit
前面讲commit冲突的时候已经描述过整体的过程,这里会写元数据metadata.json文件并提交到hive中
文件地址构建如下
private String newTableMetadataFilePath(TableMetadata meta, int newVersion) {
String codecName =
meta.property(
TableProperties.METADATA_COMPRESSION, TableProperties.METADATA_COMPRESSION_DEFAULT);
String fileExtension = TableMetadataParser.getFileExtension(codecName);
return metadataFileLocation(
meta, String.format("%05d-%s%s", newVersion, UUID.randomUUID(), fileExtension));
}
Hadoop的文件名相对简单
private Path metadataFilePath(int metadataVersion, TableMetadataParser.Codec codec) {
return metadataPath("v" + metadataVersion + TableMetadataParser.getFileExtension(codec));
}