Iceberg Flink FLIP-27实现

本文详细解析了Flink中的Source接口,特别是Iceberg如何实现getBoundedness、createEnumerator和createReader,以及流式和批式处理中的分片分配、监控间隔和数据读取。
摘要由CSDN通过智能技术生成

1 Flink基础接口

  Flink的基础接口是Source,核心是两个接口:createEnumerator和createReader
  createEnumerator负责数据发现和分片器的创建;createReader负责实际读取器的创建

public interface Source<T, SplitT extends SourceSplit, EnumChkT> extends Serializable {
    Boundedness getBoundedness();

    SourceReader<T, SplitT> createReader(SourceReaderContext var1) throws Exception;

    SplitEnumerator<SplitT, EnumChkT> createEnumerator(SplitEnumeratorContext<SplitT> var1) throws Exception;

    SplitEnumerator<SplitT, EnumChkT> restoreEnumerator(SplitEnumeratorContext<SplitT> var1, EnumChkT var2) throws Exception;

    SimpleVersionedSerializer<SplitT> getSplitSerializer();

    SimpleVersionedSerializer<EnumChkT> getEnumeratorCheckpointSerializer();
}

  createEnumerator由Flink中的SourceCoordinator调用,SourceCoordinator是Flink中的一个独立部件(实际是在JobGraph转ExecutionGraph的过程中创建的)

  在SourceCoordinator启动的时候会调用createEnumerator

if (enumerator == null) {
    final ClassLoader userCodeClassLoader =
            context.getCoordinatorContext().getUserCodeClassloader();
    try (TemporaryClassLoaderContext ignored =
            TemporaryClassLoaderContext.of(userCodeClassLoader)) {
        enumerator = source.createEnumerator(context);
    } catch (Throwable t) {
        ExceptionUtils.rethrowIfFatalErrorOrOOM(t);
        LOG.error("Failed to create Source Enumerator for source {}", operatorName, t);
        context.failJob(t);
        return;
    }
}

2 Iceberg对核心接口的实现

2.1 getBoundedness

  接口用来标志数据源是有界还是无界的,对应批还是流处理

public Boundedness getBoundedness() {
  return scanContext.isStreaming() ? Boundedness.CONTINUOUS_UNBOUNDED : Boundedness.BOUNDED;
}

  根据Scan配置属性,选择是有界还是无界;Iceberg可通过如下方式设置

SELECT * FROM hjfdb.test2 /*+ OPTIONS('streaming'='true', 'monitor-interval'='10s')*/ ;

  完整的配置Key为:connector.iceberg.streaming

public static final String STREAMING = "streaming";
public static final ConfigOption<Boolean> STREAMING_OPTION =
    ConfigOptions.key(PREFIX + STREAMING).booleanType().defaultValue(false);

2.2 createEnumerator

  createEnumerator和restoreEnumerator调用的是Iceberg实现的同一个接口

@Override
public SplitEnumerator<IcebergSourceSplit, IcebergEnumeratorState> createEnumerator(
    SplitEnumeratorContext<IcebergSourceSplit> enumContext) {
  return createEnumerator(enumContext, null);
}

@Override
public SplitEnumerator<IcebergSourceSplit, IcebergEnumeratorState> restoreEnumerator(
    SplitEnumeratorContext<IcebergSourceSplit> enumContext, IcebergEnumeratorState enumState) {
  return createEnumerator(enumContext, enumState);
}

2.2.1 分片分配器

  createEnumerator接口当中首先创建分片分配器

SplitAssigner assigner;
if (enumState == null) {
  assigner = assignerFactory.createAssigner();
} else {
  LOG.info(
      "Iceberg source restored {} splits from state for table {}",
      enumState.pendingSplits().size(),
      lazyTable().name());
  assigner = assignerFactory.createAssigner(enumState.pendingSplits());
}

  目前分配器只有一个实现:SimpleSplitAssigner,其核心是一个ArrayDeque,用来存储和分配分片

public SimpleSplitAssigner() {
  this.pendingSplits = new ArrayDeque<>();
}

public SimpleSplitAssigner(Collection<IcebergSourceSplitState> assignerState) {
  this.pendingSplits = new ArrayDeque<>(assignerState.size());
  // Because simple assigner only tracks unassigned splits,
  // there is no need to filter splits based on status (unassigned) here.
  assignerState.forEach(splitState -> pendingSplits.add(splitState.split()));
}

2.2.2 SplitEnumerator

  分片分配器设置完成后进行SplitEnumerator的创建,分为流式和批式

if (scanContext.isStreaming()) {
  ContinuousSplitPlanner splitPlanner =
      new ContinuousSplitPlannerImpl(tableLoader.clone(), scanContext, planningThreadName());
  return new ContinuousIcebergEnumerator(
      enumContext, assigner, scanContext, splitPlanner, enumState);
} else {
  List<IcebergSourceSplit> splits = planSplitsForBatch(planningThreadName());
  assigner.onDiscoveredSplits(splits);
  return new StaticIcebergEnumerator(enumContext, assigner);
}
  • ContinuousSplitPlanner
    流式需要一个流式扫描数据的计划,就是ContinuousSplitPlanner,实现类是ContinuousSplitPlannerImpl
    ContinuousSplitPlannerImpl初始化创建这里有一个线程池的设置
this.isSharedPool = threadName == null;
this.workerPool =
    isSharedPool
        ? ThreadPools.getWorkerPool()
        : ThreadPools.newWorkerPool(
            "iceberg-plan-worker-pool-" + threadName, scanContext.planParallelism());

  这里默认threadName都是有值的,所以使用独立的线程池,线程池的并行度由table.exec.iceberg.worker-pool-size设置,默认值根据机器的可用CPU核数确定

public static final int WORKER_THREAD_POOL_SIZE =
    getPoolSize(
        WORKER_THREAD_POOL_SIZE_PROP, Math.max(2, Runtime.getRuntime().availableProcessors()));
  • ContinuousIcebergEnumerator
    ContinuousIcebergEnumerator的一个核心是start的时候设置了异步调用周期
public void start() {
  super.start();
  enumeratorContext.callAsync(
      this::discoverSplits,
      this::processDiscoveredSplits,
      0L,
      scanContext.monitorInterval().toMillis());
}

  周期设置方式和流式设置方式一致

SELECT * FROM hjfdb.test2 /*+ OPTIONS('streaming'='true', 'monitor-interval'='10s')*/ ;

  完整的配置Key为:connector.iceberg.monitor-interval

public static final String MONITOR_INTERVAL = "monitor-interval";
public static final ConfigOption<String> MONITOR_INTERVAL_OPTION =
    ConfigOptions.key(PREFIX + MONITOR_INTERVAL).stringType().defaultValue("60s");

  start接口的调用也是在SourceCoordinator当中,创建完就直接调用start了

runInEventLoop(() -> enumerator.start(), "starting the SplitEnumerator.");
  • planSplitsForBatch
    批式执行方式这里直接扫描获取分片列表,具体待后文
    扫描分片确定以后,直接传入分片分配器
assigner.onDiscoveredSplits(splits);

  最终创建一个批式的Enumerator

return new StaticIcebergEnumerator(enumContext, assigner);

2.3 createReader

  createReader就比较简单,直接创建了一个IcebergSourceReader

public SourceReader<T, IcebergSourceSplit> createReader(SourceReaderContext readerContext) {
  IcebergSourceReaderMetrics metrics =
      new IcebergSourceReaderMetrics(readerContext.metricGroup(), lazyTable().name());
  return new IcebergSourceReader<>(metrics, readerFunction, readerContext);
}

3 ContinuousIcebergEnumerator

  前面提到过ContinuousIcebergEnumerator的启动接口如下

public void start() {
  super.start();
  enumeratorContext.callAsync(
      this::discoverSplits,
      this::processDiscoveredSplits,
      0L,
      scanContext.monitorInterval().toMillis());
}

  这是一个定时的异步调用接口,核心是discoverSplits和processDiscoveredSplits,discoverSplits的结果由processDiscoveredSplits处理

3.1 discoverSplits

  这个操作是扫描获取完整分片的,这里会有一个暂停扫描的阈值,阈值说明如下

// if ScanContext#maxPlanningSnapshotCount() is 10, each split enumeration can
// discovery splits up to 10 snapshots. if maxHistorySize is 3, the max number of
// splits tracked in assigner shouldn't be more than 10 * (3 + 1) snapshots
// worth of splits. +1 because there could be another enumeration when the
// pending splits fall just below the 10 * 3.
int totalSplitCountFromRecentDiscovery = Arrays.stream(history).reduce(0, Integer::sum);
return pendingSplitCountFromAssigner >= totalSplitCountFromRecentDiscovery;

  没有达到阈值的情况下通过splitPlanner.planSplits进行扫描获取,这里lastPosition对应的是snapshot

public ContinuousEnumerationResult planSplits(IcebergEnumeratorPosition lastPosition) {
  table.refresh();
  if (lastPosition != null) {
    return discoverIncrementalSplits(lastPosition);
  } else {
    return discoverInitialSplits();
  }
}

3.1.1 discoverInitialSplits

  进行初始扫描,扫描有不同的模式,在StreamingStartingStrategy当中枚举,其中只有TABLE_SCAN_THEN_INCREMENTAL模式是进行初次普通扫描以后再进行增量扫描
  所以这里会进行判断,如果是TABLE_SCAN_THEN_INCREMENTAL,会进行一次批扫描获取分片返回;否则分片为空。同时,不管哪种模式,会设置Position,后续步骤就不走这个接口了
  初次普通扫描接口如下,这个也就是批扫描的接口

public static List<IcebergSourceSplit> planIcebergSourceSplits(
    Table table, ScanContext context, ExecutorService workerPool) {
  try (CloseableIterable<CombinedScanTask> tasksIterable =
      planTasks(table, context, workerPool)) {
    return Lists.newArrayList(
        CloseableIterable.transform(tasksIterable, IcebergSourceSplit::fromCombinedScanTask));
  } catch (IOException e) {
    throw new UncheckedIOException("Failed to process task iterable: ", e);
  }
}

  真正的扫描在planTasks当中,按模式分增量和非增量,判断方式如下,流模式的初次扫描设置这几个值都为空,所以也进行批扫描

static ScanMode checkScanMode(ScanContext context) {
  if (context.startSnapshotId() != null
      || context.endSnapshotId() != null
      || context.startTag() != null
      || context.endTag() != null) {
    return ScanMode.INCREMENTAL_APPEND_SCAN;
  } else {
    return ScanMode.BATCH;
  }
}

  批扫描走的就是Iceberg普通的TableScan

TableScan scan = table.newScan();
scan = refineScanWithBaseConfigs(scan, context, workerPool);

if (context.snapshotId() != null) {
  scan = scan.useSnapshot(context.snapshotId());
} else if (context.tag() != null) {
  scan = scan.useRef(context.tag());
} else if (context.branch() != null) {
  scan = scan.useRef(context.branch());
}

if (context.asOfTimestamp() != null) {
  scan = scan.asOfTime(context.asOfTimestamp());
}

return scan.planTasks();

3.1.2 discoverIncrementalSplits

  这是读取增量数据的接口,增量扫描时如果当前Snapshot为空或者当前Snapshot为上一次扫描过的Snapshot,则表示没有增量数据,返回空分片;否则获取增量分片列表
  首先获取上一次扫描的Snapshot和当前Snapshot的区间,然后调用接口进行扫描

ScanContext incrementalScan =
    scanContext.copyWithAppendsBetween(
        lastPosition.snapshotId(), toSnapshotInclusive.snapshotId());
List<IcebergSourceSplit> splits =
    FlinkSplitPlanner.planIcebergSourceSplits(table, incrementalScan, workerPool);

  这里扫描调用的还是上节相同的接口,只不过走的不是批扫描了,而是走的INCREMENTAL_APPEND_SCAN增量扫描模式,它使用专门的增量扫描器

IncrementalAppendScan scan = table.newIncrementalAppendScan();

  其实现类是BaseIncrementalAppendScan,扫描过程首先计算所有的增量Snapshot

// appendsBetween handles null fromSnapshotId (exclusive) properly
List<Snapshot> snapshots =
    appendsBetween(table(), fromSnapshotIdExclusive, toSnapshotIdInclusive);


private static List<Snapshot> appendsBetween(
    Table table, Long fromSnapshotIdExclusive, long toSnapshotIdInclusive) {
  List<Snapshot> snapshots = Lists.newArrayList();
  for (Snapshot snapshot :
      SnapshotUtil.ancestorsBetween(
          toSnapshotIdInclusive, fromSnapshotIdExclusive, table::snapshot)) {
    if (snapshot.operation().equals(DataOperations.APPEND)) {
      snapshots.add(snapshot);
    }
  }

  return snapshots;
}

  之后基于Snapshot获取所有的dataManifests,Manifest分为dataManifests和deleteManifests

Set<ManifestFile> manifests =
    FluentIterable.from(snapshots)
        .transformAndConcat(snapshot -> snapshot.dataManifests(table().io()))
        .filter(manifestFile -> snapshotIds.contains(manifestFile.snapshotId()))
        .toSet();

  最终基于正常的Iceberg文件扫描进行获取,这里只读取append的数据不处理delete数据的核心在ManifestEntry.Status.ADDED这个过滤条件,overwrite或者delete删除的数据,其manifest文件的类型是delete

ManifestGroup manifestGroup =
    new ManifestGroup(table().io(), manifests)
        .caseSensitive(isCaseSensitive())
        .select(scanColumns())
        .filterData(filter())
        .filterManifestEntries(
            manifestEntry ->
                snapshotIds.contains(manifestEntry.snapshotId())
                    && manifestEntry.status() == ManifestEntry.Status.ADDED)
        .specsById(table().specs())
        .ignoreDeleted();

if (context().ignoreResiduals()) {
  manifestGroup = manifestGroup.ignoreResiduals();
}

if (manifests.size() > 1 && shouldPlanWithExecutor()) {
  manifestGroup = manifestGroup.planWith(planExecutor());
}

return manifestGroup.planFiles();

3.2 processDiscoveredSplits

  这一步就是将上一步扫描到的split全部放入assigner当中,注意的是由于并发同步等问题,必须保证starting position和enumerator position是一致的

assigner.onDiscoveredSplits(result.splits());

4 分片分配-AbstractIcebergEnumerator

  前面步骤完成后是把所有的分片全部放在了assigner当中,没有进行分配。分配在AbstractIcebergEnumerator接口当中定义,这是Iceberg当中enumerator的最上层父类,直接实现Flink的SplitEnumerator
  SplitEnumerator定义了handleSourceEvent接口,负责处理来自读取器的自定义消息,有SourceCoordinator调用

} else if (event instanceof SourceEventWrapper) {
    final SourceEvent sourceEvent =
            ((SourceEventWrapper) event).getSourceEvent();
    LOG.debug(
            "Source {} received custom event from parallel task {}: {}",
            operatorName,
            subtask,
            sourceEvent);
    enumerator.handleSourceEvent(subtask, sourceEvent);

  在Iceberg的实现中,完成了分片的分配,具体操作在assignSplits接口当中,从实现上看并没有一次分配完,而是每个reader每次分配一个

Iterator<Map.Entry<Integer, String>> awaitingReader =
    readersAwaitingSplit.entrySet().iterator();
while (awaitingReader.hasNext()) {
  Map.Entry<Integer, String> nextAwaiting = awaitingReader.next();
  // if the reader that requested another split has failed in the meantime, remove
  // it from the list of waiting readers
  if (!enumeratorContext.registeredReaders().containsKey(nextAwaiting.getKey())) {
    awaitingReader.remove();
    continue;
  }

  int awaitingSubtask = nextAwaiting.getKey();
  String hostname = nextAwaiting.getValue();
  GetSplitResult getResult = assigner.getNext(hostname);
  if (getResult.status() == GetSplitResult.Status.AVAILABLE) {
    LOG.info("Assign split to subtask {}: {}", awaitingSubtask, getResult.split());
    enumeratorContext.assignSplit(getResult.split(), awaitingSubtask);
    awaitingReader.remove();

5 IcebergSourceReader

  Source的createReader接口是在createStreamOperator的时候创建的,所以这个最终是任务级别的,在TaskManager上并行执行

5.1 分片请求

  IcebergSourceReader的诸多接口都最终调用了requestSplit,在其中发送了split请求获取新的分片,对应上一章SplitEnumerator接收的请求

private void requestSplit(Collection<String> finishedSplitIds) {
  context.sendSourceEventToCoordinator(new SplitRequestEvent(finishedSplitIds));
}

5.2 处理分片

  IcebergSourceReader创建的时候调用了父类,核心传入了几个处理类,最重要的就是IcebergSourceSplitReader

public IcebergSourceReader(
    IcebergSourceReaderMetrics metrics,
    ReaderFunction<T> readerFunction,
    SourceReaderContext context) {
  super(
      () -> new IcebergSourceSplitReader<>(metrics, readerFunction, context),
      new IcebergSourceRecordEmitter<>(),
      context.getConfiguration(),
      context);
}

  IcebergSourceSplitReader在Flink当中被封装进了SplitFetcherManager。Flink的SourceOperator处理消息,处理AddSplitEvent的时候,调用sourceReader.addSplits(newSplits),之后调用SplitFetcherManager的addSplits接口,这里面完成了连个动作:1、createSplitFetcher;2、fetcher.addSplits(splitsToAdd);
  createSplitFetcher接口在其中创建SplitFetcher,SplitFetcher当中创建FetchTask,参数有SplitReader

this.fetchTask =
        new FetchTask<>(
                splitReader,
                elementsQueue,
                ids -> {
                    ids.forEach(assignedSplits::remove);
                    splitFinishedHook.accept(ids);
                    LOG.info("Finished reading from splits {}", ids);
                },
                id);

  FetchTask的run方法会调用SplitReader的fetch接口获取分片,Iceberg的实现如下

public RecordsWithSplitIds<RecordAndPosition<T>> fetch() throws IOException {
  metrics.incrementSplitReaderFetchCalls(1);
  if (currentReader == null) {
    IcebergSourceSplit nextSplit = splits.poll();
    if (nextSplit != null) {
      currentSplit = nextSplit;
      currentSplitId = nextSplit.splitId();
      currentReader = openSplitFunction.apply(currentSplit);
    } else {
      // return an empty result, which will lead to split fetch to be idle.
      // SplitFetcherManager will then close idle fetcher.
      return new RecordsBySplits(Collections.emptyMap(), Collections.emptySet());
    }
  }

  if (currentReader.hasNext()) {
    // Because Iterator#next() doesn't support checked exception,
    // we need to wrap and unwrap the checked IOException with UncheckedIOException
    try {
      return currentReader.next();
    } catch (UncheckedIOException e) {
      throw e.getCause();
    }
  } else {
    return finishSplit();
  }
}

  openSplitFunction就是文件读取类,在IcebergSource当中定义

if (readerFunction == null) {
  if (table instanceof BaseMetadataTable) {
    MetaDataReaderFunction rowDataReaderFunction =
        new MetaDataReaderFunction(
            flinkConfig, table.schema(), context.project(), table.io(), table.encryption());
    this.readerFunction = (ReaderFunction<T>) rowDataReaderFunction;
  } else {
    RowDataReaderFunction rowDataReaderFunction =
        new RowDataReaderFunction(
            flinkConfig,
            table.schema(),
            context.project(),
            context.nameMapping(),
            context.caseSensitive(),
            table.io(),
            table.encryption(),
            context.filters());
    this.readerFunction = (ReaderFunction<T>) rowDataReaderFunction;
  }
}

  fetch接口当中数据来源是splits.poll(),这splits由handleSplitsChanges接口设置,其调用来源则是上文createSplitFetcher后的fetcher.addSplits(splitsToAdd)调用

public void handleSplitsChanges(SplitsChange<IcebergSourceSplit> splitsChange) {
  if (!(splitsChange instanceof SplitsAddition)) {
    throw new UnsupportedOperationException(
        String.format("Unsupported split change: %s", splitsChange.getClass()));
  }

  LOG.info("Add {} splits to reader", splitsChange.splits().size());
  splits.addAll(splitsChange.splits());
  metrics.incrementAssignedSplits(splitsChange.splits().size());
  metrics.incrementAssignedBytes(calculateBytes(splitsChange));
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值