iceberg flink 读操作

本文解析了Apache Iceberg与Apache Flink集成中的FlinkInputFormat.open方法,深入探讨了数据读取流程,从任务拆分到加密解密,再到Parquet读取细节。重点展示了RowDataIterator和DataIterator的构造与操作,以及如何通过迭代器处理文件扫描和数据投影。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

org.apache.iceberg.flink.source.FlinkInputFormat.open

@Override
public void open(FlinkInputSplit split) {
    // split.getTask().files(): FlinkInputSplit.CombinedScanTask.Collection<FileScanTask>: SplitScanTask
    // encryptionClass: org.apache.iceberg.encryption.PlaintextEncryptionManager
    // tableSchema & context.project(): org.apache.iceberg.Schema
    // caseSensitive: false
    this.iterator = new RowDataIterator(
            split.getTask(), io, encryption, tableSchema, context.project(), context.nameMapping(),
            context.caseSensitive());
}

org.apache.iceberg.flink.source.DataIterator.DataIterator

DataIterator(CombinedScanTask task, FileIO io, EncryptionManager encryption) {
    this.tasks = task.files().iterator();

    Map<String, ByteBuffer> keyMetadata = Maps.newHashMap();
    task.files().stream()
        .flatMap(fileScanTask -> Stream.concat(Stream.of(fileScanTask.file()), fileScanTask.deletes().stream()))
        .forEach(file -> keyMetadata.put(file.path().toString(), file.keyMetadata()));
    Stream<EncryptedInputFile> encrypted = keyMetadata.entrySet().stream()
        .map(entry -> EncryptedFiles.encryptedInput(io.newInputFile(entry.getKey()), entry.getValue()));

    // decrypt with the batch call to avoid multiple RPCs to a key server, if possible
    Iterable<InputFile> decryptedFiles = encryption.decrypt(encrypted::iterator);

    Map<String, InputFile> files = Maps.newHashMapWithExpectedSize(task.files().size());
    decryptedFiles.forEach(decrypted -> files.putIfAbsent(decrypted.location(), decrypted));

    // 该 task 进程的所有需要 scan 的 parq 文件路径
    this.inputFiles = Collections.unmodifiableMap(files);

    this.currentIterator = CloseableIterator.empty();
}

org.apache.iceberg.hadoop.HadoopFileIO#newInputFile {...}

org.apache.iceberg.flink.source.RowDataIterator.RowDataIterator

RowDataIterator(CombinedScanTask task, FileIO io, EncryptionManager encryption, Schema tableSchema,
        Schema projectedSchema, String nameMapping, boolean caseSensitive) {
    super(task, io, encryption);
    // tableSchema & projectedSchema: org.apache.iceberg.Schema
    this.tableSchema = tableSchema;
    this.projectedSchema = projectedSchema;
    this.nameMapping = nameMapping;
    // false
    this.caseSensitive = caseSensitive;
}

org.apache.iceberg.flink.source.DataIterator.hasNext

@Override
public boolean hasNext() {
    updateCurrentIterator();
    return currentIterator.hasNext();
}

org.apache.iceberg.flink.source.DataIterator.updateCurrentIterator

private void updateCurrentIterator() {
    try {
        while (!currentIterator.hasNext() && tasks.hasNext()) {
            currentIterator.close();
            currentIterator = openTaskIterator(tasks.next());
        }
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
}

org.apache.iceberg.flink.source.RowDataIterator.openTaskIterator

org.apache.iceberg.flink.source.RowDataIterator#newIterable

org.apache.iceberg.flink.source.RowDataIterator#newParquetIterable

org.apache.iceberg.parquet.ParquetReader.iterator

@Override
public CloseableIterator<T> iterator() {
    FileIterator<T> iter = new FileIterator<>(init());
    addCloseable(iter);
    return iter;
}

org.apache.iceberg.flink.source.DataIterator.next

@Override
public T next() {
    updateCurrentIterator();
    return currentIterator.next();
}

org.apache.iceberg.flink.source.DataIterator.updateCurrentIterator

private void updateCurrentIterator() {
    try {
        while (!currentIterator.hasNext() && tasks.hasNext()) {
            currentIterator.close();
            currentIterator = openTaskIterator(tasks.next());
        }
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
}

org.apache.iceberg.parquet.ParquetReader.FileIterator.next

@Override
public T next() {
    // valuesRead: 0
    // nextRowGroupStart: 0
    if (valuesRead >= nextRowGroupStart) {
        advance();
    }

    if (reuseContainers) {
        this.last = model.read(last);
    } else {
        this.last = model.read(null);
    }
    valuesRead += 1;

    return last;
}

org.apache.iceberg.parquet.ParquetValueReaders.StructReader.read

@Override
public final T read(T reuse) {
    I intermediate = newStructData(reuse);

    for (int i = 0; i < readers.length; i += 1) {
        set(intermediate, i, readers[i].read(get(intermediate, i)));
    }

    return buildStruct(intermediate);
}

org.apache.iceberg.flink.data.FlinkParquetReaders.StringReader.read

@Override
public StringData read(StringData ignored) {
    Binary binary = column.nextBinary();
    ByteBuffer buffer = binary.toByteBuffer();
    if (buffer.hasArray()) {
        return StringData.fromBytes(
                buffer.array(), buffer.arrayOffset() + buffer.position(), buffer.remaining());
    } else {
        return StringData.fromBytes(binary.getBytes());
    }
}

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

PeersLee

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值