记录flink-mysql-connector 全量读取表Split Chunk过程

该文章详细介绍了Flink连接器处理MySQLCDC时的全量快照切分过程。通过获取表的主键列,依据配置的chunk大小和分布因子,判断并执行均匀或非均匀的chunk切分策略,最终生成MySqlSnapshotSplit对象用于数据摄取。切分过程中考虑了数据分布均匀性,以优化数据读取效率。
摘要由CSDN通过智能技术生成

以下源码基于flink-connector-mysql-cdc version 2.3.0

进行split方法如下:

	package com.ververica.cdc.connectors.mysql.source.assigners;
	
	private void splitChunksForRemainingTables() {
        try {
        	//flink任务会从checkpoint中获取已经进行过全量的表,下面的remainingTables为未进行过snapshot的表
            for (TableId nextTable : remainingTables) {
                // split the given table into chunks (snapshot splits)
                Collection<MySqlSnapshotSplit> splits = chunkSplitter.generateSplits(nextTable);
                synchronized (lock) {
                    remainingSplits.addAll(splits);
                    remainingTables.remove(nextTable);
                    lock.notify();
                }
            }
        } catch (Exception e) {
            if (uncaughtSplitterException == null) {
                uncaughtSplitterException = e;
            } else {
                uncaughtSplitterException.addSuppressed(e);
            }
            // Release the potential waiting getNext() call
            synchronized (lock) {
                lock.notify();
            }
        }
    }

主要看generateSplits方法:

	/** Generates all snapshot splits (chunks) for the give table path. */
    public Collection<MySqlSnapshotSplit> generateSplits(TableId tableId) {
        try (JdbcConnection jdbc = openJdbcConnection(sourceConfig)) {

            LOG.info("Start splitting table {} into chunks...", tableId);
            long start = System.currentTimeMillis();

            Table table = mySqlSchema.getTableSchema(jdbc, tableId).getTable();
            //获取表对应进行chuck的列
            Column chunkKeyColumn =
                    ChunkUtils.getChunkKeyColumn(table, sourceConfig.getChunkKeyColumn());
            final List<ChunkRange> chunks;
            try {
            	//根据chunkKeyColumn 进行split
                chunks = splitTableIntoChunks(jdbc, tableId, chunkKeyColumn);
            } catch (SQLException e) {
                throw new FlinkRuntimeException("Failed to split chunks for table " + tableId, e);
            }

            // convert chunks into splits
            List<MySqlSnapshotSplit> splits = new ArrayList<>();
            RowType chunkKeyColumnType = ChunkUtils.getChunkKeyColumnType(chunkKeyColumn);
            for (int i = 0; i < chunks.size(); i++) {
                ChunkRange chunk = chunks.get(i);
                //把每个chunk封装为MySqlSnapshotSplit 
                MySqlSnapshotSplit split =
                        createSnapshotSplit(
                                jdbc,
                                tableId,
                                i,
                                chunkKeyColumnType,
                                chunk.getChunkStart(),
                                chunk.getChunkEnd());
                splits.add(split);
            }

            long end = System.currentTimeMillis();
            LOG.info(
                    "Split table {} into {} chunks, time cost: {}ms.",
                    tableId,
                    splits.size(),
                    end - start);
            return splits;
        } catch (Exception e) {
            throw new FlinkRuntimeException(
                    String.format("Generate Splits for table %s error", tableId), e);
        }
    }

细看getChunkKeyColumn方法,获取ChunkKeyColumn:

	public static Column getChunkKeyColumn(Table table, @Nullable String chunkKeyColumn) {
		//获取表主键,如果没有主键直接报错
        List<Column> primaryKeys = table.primaryKeyColumns();
        if (primaryKeys.isEmpty()) {
            throw new ValidationException(
                    String.format(
                            "Incremental snapshot for tables requires primary key,"
                                    + " but table %s doesn't have primary key.",
                            table.id()));
        }
		
		//如果有配置chunkKeyColumn,则进行此列是否为主键列,是则返回改列,不是则报错
        if (chunkKeyColumn != null) {
            Optional<Column> targetPkColumn =
                    primaryKeys.stream()
                            .filter(col -> chunkKeyColumn.equals(col.name()))
                            .findFirst();
            if (targetPkColumn.isPresent()) {
                return targetPkColumn.get();
            }
            throw new ValidationException(
                    String.format(
                            "Chunk key column '%s' doesn't exist in the primary key [%s] of the table %s.",
                            chunkKeyColumn,
                            primaryKeys.stream().map(Column::name).collect(Collectors.joining(",")),
                            table.id()));
        }
		
		//对于联合主键表,默认使用第一个主键列作为chunkKeyColumn
        // use the first column of primary key columns as the chunk key column by default
        return primaryKeys.get(0);
    }

获取到chuckKeyColumn后,进入splitTableIntoChunks方法开始split:

	private List<ChunkRange> splitTableIntoChunks(
            JdbcConnection jdbc, TableId tableId, Column splitColumn) throws SQLException {
        //获取splitColumnName,即为chunkKeyColumn列名
        final String splitColumnName = splitColumn.name();
        //通过"SELECT MIN(splitColumnName), MAX(splitColumnName) FROM tableId" 语句查询splitColumnName列最大值,最小值
        final Object[] minMaxOfSplitColumn = queryMinMax(jdbc, tableId, splitColumnName);
        final Object min = minMaxOfSplitColumn[0];
        final Object max = minMaxOfSplitColumn[1];
        if (min == null || max == null || min.equals(max)) {
        	//如果是空表,或者只有一行数据,则把整张表作为一个chunk返回
            // empty table, or only one row, return full table scan as a chunk
            return Collections.singletonList(ChunkRange.all());
        }
		//获取配置的scan.incremental.snapshot.chunk.size,即split chunk size,默认8096
        final int chunkSize = sourceConfig.getSplitSize();
        //获取配置的chunk-key.even-distribution.factor.upper-bound,默认1000.0d
        final double distributionFactorUpper = sourceConfig.getDistributionFactorUpper();
        //获取配置的chunk-key.even-distribution.factor.lower-bound,默认0.05d
        final double distributionFactorLower = sourceConfig.getDistributionFactorLower();
		
		//判断splitColumn是否为可分布均匀列,判断标准为:列类型为BIGINT,INTEGER,DECIMAL则认为是可分布均匀列,其他类型认为是不可均匀列
        if (isEvenlySplitColumn(splitColumn)) {
        	//使用SHOW TABLE STATUS LIKE 'tableId',快速查询表总行数
            long approximateRowCnt = queryApproximateRowCnt(jdbc, tableId);
            //计算表分布因子,approximateRowCnt=0,则distributionFactor = Double.MAX_VALUE,默认为distributionFactor = (max - min + 1) / approximateRowCnt
            double distributionFactor =
                    calculateDistributionFactor(tableId, min, max, approximateRowCnt);
			//计算distributionFactor是否在distributionFactorLower和distributionFactorUpper之间,在则认为表分布均匀,不在则认为表分布不均匀
            boolean dataIsEvenlyDistributed =
                    doubleCompare(distributionFactor, distributionFactorLower) >= 0
                            && doubleCompare(distributionFactor, distributionFactorUpper) <= 0;

            if (dataIsEvenlyDistributed) {
                // the minimum dynamic chunk size is at least 1
                //如果是分布均匀的,则按分布因子获取dynamicChunkSize
                final int dynamicChunkSize = Math.max((int) (distributionFactor * chunkSize), 1);	
                //对分布均匀的表进行拆分
                return splitEvenlySizedChunks(
                        tableId, min, max, approximateRowCnt, chunkSize, dynamicChunkSize);
            } else {
            	//对分布不均匀的表进行拆分
                return splitUnevenlySizedChunks(
                        jdbc, tableId, splitColumnName, min, max, chunkSize);
            }
        } else {
        	//对splitColumn为不可分布均匀列的表进行拆分
            return splitUnevenlySizedChunks(jdbc, tableId, splitColumnName, min, max, chunkSize);
        }
    }

对分布均匀的表进行拆分 splitEvenlySizedChunks方法:

	public List<ChunkRange> splitEvenlySizedChunks(
            TableId tableId,
            Object min,
            Object max,
            long approximateRowCnt,
            int chunkSize,
            int dynamicChunkSize) {
        LOG.info(
                "Use evenly-sized chunk optimization for table {}, the approximate row count is {}, the chunk size is {}, the dynamic chunk size is {}",
                tableId,
                approximateRowCnt,
                chunkSize,
                dynamicChunkSize);
        //如果配置或默认的chunkSize >= 表数据条数approximateRowCnt,则只返回一个chunk,不进行split
        if (approximateRowCnt <= chunkSize) {
            // there is no more than one chunk, return full table as a chunk
            return Collections.singletonList(ChunkRange.all());
        }

        final List<ChunkRange> splits = new ArrayList<>();
        Object chunkStart = null;
        //第一个chunk的End值为min+dynamicChunkSize
        Object chunkEnd = ObjectUtils.plus(min, dynamicChunkSize);
        //按dynamicChunkSize为step大小,对min到max进行拆分
        while (ObjectUtils.compare(chunkEnd, max) <= 0) {
            splits.add(ChunkRange.of(chunkStart, chunkEnd));
            chunkStart = chunkEnd;
            try {
                chunkEnd = ObjectUtils.plus(chunkEnd, dynamicChunkSize);
            } catch (ArithmeticException e) {
                // Stop chunk split to avoid dead loop when number overflows.
                break;
            }
        }
        //添加(max-min)/dynamicChunkSize 的余数,即上面while循环后剩余的数据
        // add the ending split
        splits.add(ChunkRange.of(chunkStart, null));
        return splits;
    }

对分布不均匀的表进行拆分 splitUnevenlySizedChunks方法:

	private List<ChunkRange> splitUnevenlySizedChunks(
            JdbcConnection jdbc,
            TableId tableId,
            String splitColumnName,
            Object min,
            Object max,
            int chunkSize)
            throws SQLException {
        LOG.info(
                "Use unevenly-sized chunks for table {}, the chunk size is {}", tableId, chunkSize);
        final List<ChunkRange> splits = new ArrayList<>();
        Object chunkStart = null;
        //主要使用"SELECT MAX(%s) FROM (SELECT %s FROM %s WHERE %s >= ? ORDER BY %s ASC LIMIT %s) AS T"来查询每个chunk的End位置,并对查询到的End进行了判断,详细可看源码nextChunkEnd方法
        Object chunkEnd = nextChunkEnd(jdbc, min, tableId, splitColumnName, max, chunkSize);
        int count = 0;
        //循环添加chunk
        while (chunkEnd != null && ObjectUtils.compare(chunkEnd, max) <= 0) {
            // we start from [null, min + chunk_size) and avoid [null, min)
            splits.add(ChunkRange.of(chunkStart, chunkEnd));
            // may sleep a while to avoid DDOS on MySQL server
            maySleep(count++, tableId);
            chunkStart = chunkEnd;
            chunkEnd = nextChunkEnd(jdbc, chunkEnd, tableId, splitColumnName, max, chunkSize);
        }
        //添加未被循环到的数据
        // add the ending split
        splits.add(ChunkRange.of(chunkStart, null));
        return splits;
    }

总结下MySqlSnapshot Split Chunk过程:

  1. generateSplits方法,获取chunkKeyColumn,用于根据chunkKeyColumn进行split
  2. splitTableIntoChunks方法,根据chunkKeyColumn列查询此列最大、最小值
    2.1 若最大、最小值一致,则返回单一chunk
    2.2 获取chunkSize、distributionFactorUpper、distributionFactorLower,用于后续split大小,及判断表分布是否均匀
    2.3 根据splitColumn,即chunkKeyColumn列类型转换后是否为BIGINT,INTEGER,DECIMAL来区分此列是否为可均匀分布列
    2.3.1 是 可均匀分布列,则查询表数据量,计算distributionFactor,默认为 (max - min + 1) / rowCount,判断distributionFactor是否在distributionFactorLower和distributionFactorUpper之间,在则为均匀分布
    2.3.1.1 表数据为均匀分布,则按照分布因子*chunk大小,即distributionFactor * chunkSize来split chunk
    2.3.1.2 表数据不为均匀分布,则按照chunkSize来split chunk
    2.3.2 不是 可均匀分布列,则按照chunkSize来split chunk,和2.3.1.2 表数据不为均匀分布,处理逻辑一致
  3. split chunk后,遍历chunks,进入createSnapshotSplit方法,构建MySqlSnapshotSplit对象,此对象用于描述MySql表快照拆分的拆分,包含jdbc信息,表名,为第几个chunk块,chunkKeyColumnType,ChunkStart,ChunkEnd
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值