Flink在Stream模式下的OrcTableSource支持读取多个数据目录

依赖

  1. Flink版本基于1.9.2

需要更新的4个类

  1. 自己创建 MultiFileInputFormat类

    flink在stream模式下读取文件时有两种模式, 一种是走ContinuousFileMonitoringFunction, 这种模式底层会是单线程读取数据, 同时按照文件的更新时间排序后从小到大读取文件, 保证数据时序性. 另外是直接多线程同时读取, 没有时序性, 而我们需要走第一种模式, 所以必须继承FileInputFormat, 同时让该类支持配置多个数据目录.

  1. 自己创建 OrcMultiRowInputFormat类

    直接copy原来的OrcRowInputFormat类, 修改部分代码让其支持配置多个数据目录

  1. 更新源码 OrcTableSource类

    更新部分代码, 让其使用新建的OrcMultiRowInputFormat, 支持多数据目录配置

  1. 更新源码 ContinuousFileMonitoringFunction类

    更新部分代码, 支持多数据目录配置

代码

  1. MultiFileInputFormat
import org.apache.flink.api.common.io.*;
import org.apache.flink.api.common.io.compression.*;
import org.apache.flink.api.common.io.statistics.BaseStatistics;
import org.apache.flink.configuration.ConfigConstants;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.configuration.GlobalConfiguration;
import org.apache.flink.core.fs.*;
import org.apache.flink.util.Preconditions;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.*;

public abstract class MultiFileInputFormat<OT> extends FileInputFormat<OT> {


    // -------------------------------------- Constants -------------------------------------------

    private static final Logger LOG = LoggerFactory.getLogger(FileInputFormat.class);

    private static final long serialVersionUID = 1L;


    /**
     * The fraction that the last split may be larger than the others.
     */
    private static final float MAX_SPLIT_SIZE_DISCREPANCY = 1.1f;

    /**
     * The timeout (in milliseconds) to wait for a filesystem stream to respond.
     */
    private static long DEFAULT_OPENING_TIMEOUT;

    /**
     * A mapping of file extensions to decompression algorithms based on DEFLATE. Such compressions lead to
     * unsplittable files.
     */
    protected static final Map<String, InflaterInputStreamFactory<?>> INFLATER_INPUT_STREAM_FACTORIES =
            new HashMap<String, InflaterInputStreamFactory<?>>();

    /**
     * The splitLength is set to -1L for reading the whole split.
     */
    protected static final long READ_WHOLE_SPLIT_FLAG = -1L;

    static {
        initDefaultsFromConfiguration(GlobalConfiguration.loadConfiguration());
        initDefaultInflaterInputStreamFactories();
    }

    /**
     * Initialize defaults for input format. Needs to be a static method because it is configured for local
     * cluster execution.
     * @param configuration The configuration to load defaults from
     */
    private static void initDefaultsFromConfiguration(Configuration configuration) {
        final long to = configuration.getLong(ConfigConstants.FS_STREAM_OPENING_TIMEOUT_KEY,
                ConfigConstants.DEFAULT_FS_STREAM_OPENING_TIMEOUT);
        if (to < 0) {
            LOG.error("Invalid timeout value for filesystem stream opening: " + to + ". Using default value of " +
                    ConfigConstants.DEFAULT_FS_STREAM_OPENING_TIMEOUT);
            DEFAULT_OPENING_TIMEOUT = ConfigConstants.DEFAULT_FS_STREAM_OPENING_TIMEOUT;
        } else if (to == 0) {
            DEFAULT_OPENING_TIMEOUT = 300000; // 5 minutes
        } else {
            DEFAULT_OPENING_TIMEOUT = to;
        }
    }

    private static void initDefaultInflaterInputStreamFactories() {
        InflaterInputStreamFactory<?>[] defaultFactories = {
                DeflateInflaterInputStreamFactory.getInstance(),
                GzipInflaterInputStreamFactory.getInstance(),
                Bzip2InputStreamFactory.getInstance(),
                XZInputStreamFactory.getInstance(),
        };
        for (InflaterInputStreamFactory<?> inputStreamFactory : defaultFactories) {
            for (String fileExtension : inputStreamFactory.getCommonFileExtensions()) {
                registerInflaterInputStreamFactory(fileExtension, inputStreamFactory);
            }
        }
    }

    /**
     * Registers a decompression algorithm through a {@link org.apache.flink.api.common.io.compression.InflaterInputStreamFactory}
     * with a file extension for transparent decompression.
     * @param fileExtension of the compressed files
     * @param factory to create an {@link java.util.zip.InflaterInputStream} that handles the decompression format
     */
    public static void registerInflaterInputStreamFactory(String fileExtension, InflaterInputStreamFactory<?> factory) {
        synchronized (INFLATER_INPUT_STREAM_FACTORIES) {
            if (INFLATER_INPUT_STREAM_FACTORIES.put(fileExtension, factory) != null) {
                LOG.warn("Overwriting an existing decompression algorithm for \"{}\" files.", fileExtension);
            }
        }
    }

    protected static InflaterInputStreamFactory<?> getInflaterInputStreamFactory(String fileExtension) {
        synchronized (INFLATER_INPUT_STREAM_FACTORIES) {
            return INFLATER_INPUT_STREAM_FACTORIES.get(fileExtension);
        }
    }

    /**
     * Returns the extension of a file name (!= a path).
     * @return the extension of the file name or {@code null} if there is no extension.
     */
    protected static String extractFileExtension(String fileName) {
        int lastPeriodIndex = fileName.lastIndexOf('.');
        if (lastPeriodIndex < 0){
            return null;
        } else {
            return fileName.substring(lastPeriodIndex + 1);
        }
    }

    // --------------------------------------------------------------------------------------------
    //  Variables for internal operation.
    //  They are all transient, because we do not want them so be serialized
    // --------------------------------------------------------------------------------------------

    /**
     * The input stream reading from the input file.
     */
    protected transient FSDataInputStream stream;

    /**
     * The start of the split that this parallel instance must consume.
     */
    protected transient long splitStart;

    /**
     * The length of the split that this parallel instance must consume.
     */
    protected transient long splitLength;

    /**
     * The current split that this parallel instance must consume.
     */
    protected transient FileInputSplit currentSplit;

    // --------------------------------------------------------------------------------------------
    //  The configuration parameters. Configured on the instance and serialized to be shipped.
    // --------------------------------------------------------------------------------------------

    /**
     * The path to the file that contains the input.
     *
     * @deprecated Please override {@link FileInputFormat#supportsMultiPaths()} and
     *             use {@link FileInputFormat#getFilePaths()} and {@link FileInputFormat#setFilePaths(Path...)}.
     */
    @Deprecated
    protected Path filePath;

    /**
     * The list of paths to files and directories that contain the input.
     */
    private Path[] filePaths;

    /**
     * The minimal split size, set by the configure() method.
     */
    protected long minSplitSize = 0;

    /**
     * The desired number of splits, as set by the configure() method.
     */
    protected int numSplits = -1;

    /**
     * Stream opening timeout.
     */
    protected long openTimeout = DEFAULT_OPENING_TIMEOUT;

    /**
     * Some file input formats are not splittable on a block level (avro, deflate)
     * Therefore, the FileInputFormat can only read whole files.
     */
    protected boolean unsplittable = false;

    /**
     * The flag to specify whether recursive traversal of the input directory
     * structure is enabled.
     */
    protected boolean enumerateNestedFiles = false;

    /**
     * Files filter for determining what files/directories should be included.
     */
    private FilePathFilter filesFilter = new GlobFilePathFilter();

    // --------------------------------------------------------------------------------------------
    //  Constructors
    // --------------------------------------------------------------------------------------------

    public MultiFileInputFormat() {}

    protected MultiFileInputFormat(Path... filePath) {
        if (filePath != null) {
            setFilePaths(filePath);
        }
    }
    // --------------------------------------------------------------------------------------------
    //  Getters/setters for the configurable parameters
    // --------------------------------------------------------------------------------------------

    /**
     *
     * @return The path of the file to read.
     *
     * @deprecated Please use getFilePaths() instead.
     */
    @Deprecated
    public Path getFilePath() {

        if (supportsMultiPaths()) {
            if (this.filePaths == null || this.filePaths.length == 0) {
                return null;
            } else if (this.filePaths.length == 1) {
                return this.filePaths[0];
            } else {
                throw new UnsupportedOperationException(
                        "FileInputFormat is configured with multiple paths. Use getFilePaths() instead.");
            }
        } else {
            return filePath;
        }
    }

    /**
     * Returns the paths of all files to be read by the FileInputFormat.
     *
     * @return The list of all paths to read.
     */
    public Path[] getFilePaths() {
        return this.filePaths;
    }

    public void setFilePath(String filePath) {
        if (filePath == null) {
            throw new IllegalArgumentException("File path cannot be null.");
        }

        // TODO The job-submission web interface passes empty args (and thus empty
        // paths) to compute the preview graph. The following is a workaround for
        // this situation and we should fix this.

        // comment (Stephan Ewen) this should be no longer relevant with the current Java/Scala APIs.
        if (filePath.isEmpty()) {
            setFilePath(new Path());
            return;
        }

        try {
            this.setFilePath(new Path(filePath));
        } catch (RuntimeException rex) {
            throw new RuntimeException("Could not create a valid URI from the given file path name: " + rex.getMessage());
        }
    }

    /**
     * Sets a single path of a file to be read.
     *
     * @param filePath The path of the file to read.
     */
    public void setFilePath(Path filePath) {
        if (filePath == null) {
            throw new IllegalArgumentException("File path must not be null.");
        }

        setFilePaths(filePath);
    }

    /**
     * Sets multiple paths of files to be read.
     *
     * @param filePaths The paths of the files to read.
     */
    public void setFilePaths(String... filePaths) {
        Path[] paths = new Path[filePaths.length];
        for (int i = 0; i < paths.length; i++) {
            paths[i] = new Path(filePaths[i]);
        }
        setFilePaths(paths);
    }

    /**
     * Sets multiple paths of files to be read.
     *
     * @param filePaths The paths of the files to read.
     */
    public void setFilePaths(Path... filePaths) {
        if (filePaths.length < 1) {
            throw new IllegalArgumentException("At least one file path must be specified.");
        }
        if (filePaths.length == 1) {
            // set for backwards compatibility
            this.filePath = filePaths[0];
        } else {
            // clear file path in case it had been set before
            this.filePath = null;
        }

        this.filePaths = filePaths;
    }

    public long getMinSplitSize() {
        return minSplitSize;
    }

    public void setMinSplitSize(long minSplitSize) {
        if (minSplitSize < 0) {
            throw new IllegalArgumentException("The minimum split size cannot be negative.");
        }

        this.minSplitSize = minSplitSize;
    }

    public int getNumSplits() {
        return numSplits;
    }

    public void setNumSplits(int numSplits) {
        if (numSplits < -1 || numSplits == 0) {
            throw new IllegalArgumentException("The desired number of splits must be positive or -1 (= don't care).");
        }

        this.numSplits = numSplits;
    }

    public long getOpenTimeout() {
        return openTimeout;
    }

    public void setOpenTimeout(long openTimeout) {
        if (openTimeout < 0) {
            throw new IllegalArgumentException("The timeout for opening the input splits must be positive or zero (= infinite).");
        }
        this.openTimeout = openTimeout;
    }

    public void setNestedFileEnumeration(boolean enable) {
        this.enumerateNestedFiles = enable;
    }

    public boolean getNestedFileEnumeration() {
        return this.enumerateNestedFiles;
    }

    // --------------------------------------------------------------------------------------------
    // Getting information about the split that is currently open
    // --------------------------------------------------------------------------------------------

    /**
     * Gets the start of the current split.
     *
     * @return The start of the split.
     */
    public long getSplitStart() {
        return splitStart;
    }

    /**
     * Gets the length or remaining length of the current split.
     *
     * @return The length or remaining length of the current split.
     */
    public long getSplitLength() {
        return splitLength;
    }

    public void setFilesFilter(FilePathFilter filesFilter) {
        this.filesFilter = Preconditions.checkNotNull(filesFilter, "Files filter should not be null");
    }

    // --------------------------------------------------------------------------------------------
    //  Pre-flight: Configuration, Splits, Sampling
    // --------------------------------------------------------------------------------------------

    /**
     * Configures the file input format by reading the file path from the configuration.
     *
     * @see org.apache.flink.api.common.io.InputFormat#configure(org.apache.flink.configuration.Configuration)
     */
    @Override
    public void configure(Configuration parameters) {

        if (getFilePaths().length == 0) {
            // file path was not specified yet. Try to set it from the parameters.
            String filePath = parameters.getString(FILE_PARAMETER_KEY, null);
            if (filePath == null) {
                throw new IllegalArgumentException("File path was not specified in input format or configuration.");
            } else {
                setFilePath(filePath);
            }
        }

        if (!this.enumerateNestedFiles) {
            this.enumerateNestedFiles = parameters.getBoolean(ENUMERATE_NESTED_FILES_FLAG, false);
        }
    }

    protected FileBaseStatistics getFileStats(FileBaseStatistics cachedStats, Path[] filePaths, ArrayList<FileStatus> files) throws IOException {

        long totalLength = 0;
        long latestModTime = 0;

        for (Path path : filePaths) {
            final FileSystem fs = FileSystem.get(path.toUri());
            final FileBaseStatistics stats = getFileStats(cachedStats, path, fs, files);

            if (stats.getTotalInputSize() == BaseStatistics.SIZE_UNKNOWN) {
                totalLength = BaseStatistics.SIZE_UNKNOWN;
            } else if (totalLength != BaseStatistics.SIZE_UNKNOWN) {
                totalLength += stats.getTotalInputSize();
            }
            latestModTime = Math.max(latestModTime, stats.getLastModificationTime());
        }

        // check whether the cached statistics are still valid, if we have any
        if (cachedStats != null && latestModTime <= cachedStats.getLastModificationTime()) {
            return cachedStats;
        }

        return new FileBaseStatistics(latestModTime, totalLength, BaseStatistics.AVG_RECORD_BYTES_UNKNOWN);
    }

    protected FileBaseStatistics getFileStats(FileBaseStatistics cachedStats, Path filePath, FileSystem fs, ArrayList<FileStatus> files) throws IOException {

        // get the file info and check whether the cached statistics are still valid.
        final FileStatus file = fs.getFileStatus(filePath);
        long totalLength = 0;

        // enumerate all files
        if (file.isDir()) {
            totalLength += addFilesInDir(file.getPath(), files, false);
        } else {
            files.add(file);
            testForUnsplittable(file);
            totalLength += file.getLen();
        }

        // check the modification time stamp
        long latestModTime = 0;
        for (FileStatus f : files) {
            latestModTime = Math.max(f.getModificationTime(), latestModTime);
        }

        // check whether the cached statistics are still valid, if we have any
        if (cachedStats != null && latestModTime <= cachedStats.getLastModificationTime()) {
            return cachedStats;
        }

        // sanity check
        if (totalLength <= 0) {
            totalLength = BaseStatistics.SIZE_UNKNOWN;
        }
        return new FileBaseStatistics(latestModTime, totalLength, BaseStatistics.AVG_RECORD_BYTES_UNKNOWN);
    }

    @Override
    public LocatableInputSplitAssigner getInputSplitAssigner(FileInputSplit[] splits) {
        return new LocatableInputSplitAssigner(splits);
    }

    /**
     * Computes the input splits for the file. By default, one file block is one split. If more splits
     * are requested than blocks are available, then a split may be a fraction of a block and splits may cross
     * block boundaries.
     *
     * @param minNumSplits The minimum desired number of file splits.
     * @return The computed file splits.
     *
     * @see org.apache.flink.api.common.io.InputFormat#createInputSplits(int)
     */
    @Override
    public FileInputSplit[] createInputSplits(int minNumSplits) throws IOException {
        if (minNumSplits < 1) {
            throw new IllegalArgumentException("Number of input splits has to be at least 1.");
        }

        // take the desired number of splits into account
        minNumSplits = Math.max(minNumSplits, this.numSplits);

        final List<FileInputSplit> inputSplits = new ArrayList<FileInputSplit>(minNumSplits);

        // get all the files that are involved in the splits
        List<FileStatus> files = new ArrayList<>();
        long totalLength = 0;

        for (Path path : getFilePaths()) {
            final FileSystem fs = path.getFileSystem();
            final FileStatus pathFile = fs.getFileStatus(path);

            if (pathFile.isDir()) {
                totalLength += addFilesInDir(path, files, true);
            } else {
                testForUnsplittable(pathFile);

                files.add(pathFile);
                totalLength += pathFile.getLen();
            }
        }

        // returns if unsplittable
        if (unsplittable) {
            int splitNum = 0;
            for (final FileStatus file : files) {
                final FileSystem fs = file.getPath().getFileSystem();
                final BlockLocation[] blocks = fs.getFileBlockLocations(file, 0, file.getLen());
                Set<String> hosts = new HashSet<String>();
                for(BlockLocation block : blocks) {
                    hosts.addAll(Arrays.asList(block.getHosts()));
                }
                long len = file.getLen();
                if(testForUnsplittable(file)) {
                    len = READ_WHOLE_SPLIT_FLAG;
                }
                FileInputSplit fis = new FileInputSplit(splitNum++, file.getPath(), 0, len,
                        hosts.toArray(new String[hosts.size()]));
                inputSplits.add(fis);
            }
            return inputSplits.toArray(new FileInputSplit[inputSplits.size()]);
        }


        final long maxSplitSize = totalLength / minNumSplits + (totalLength % minNumSplits == 0 ? 0 : 1);

        // now that we have the files, generate the splits
        int splitNum = 0;
        for (final FileStatus file : files) {

            final FileSystem fs = file.getPath().getFileSystem();
            final long len = file.getLen();
            final long blockSize = file.getBlockSize();

            final long minSplitSize;
            if (this.minSplitSize <= blockSize) {
                minSplitSize = this.minSplitSize;
            }
            else {
                if (LOG.isWarnEnabled()) {
                    LOG.warn("Minimal split size of " + this.minSplitSize + " is larger than the block size of " +
                            blockSize + ". Decreasing minimal split size to block size.");
                }
                minSplitSize = blockSize;
            }

            final long splitSize = Math.max(minSplitSize, Math.min(maxSplitSize, blockSize));
            final long halfSplit = splitSize >>> 1;

            final long maxBytesForLastSplit = (long) (splitSize * MAX_SPLIT_SIZE_DISCREPANCY);

            if (len > 0) {

                // get the block locations and make sure they are in order with respect to their offset
                final BlockLocation[] blocks = fs.getFileBlockLocations(file, 0, len);
                Arrays.sort(blocks);

                long bytesUnassigned = len;
                long position = 0;

                int blockIndex = 0;

                while (bytesUnassigned > maxBytesForLastSplit) {
                    // get the block containing the majority of the data
                    blockIndex = getBlockIndexForPosition(blocks, position, halfSplit, blockIndex);
                    // create a new split
                    FileInputSplit fis = new FileInputSplit(splitNum++, file.getPath(), position, splitSize,
                            blocks[blockIndex].getHosts());
                    inputSplits.add(fis);

                    // adjust the positions
                    position += splitSize;
                    bytesUnassigned -= splitSize;
                }

                // assign the last split
                if (bytesUnassigned > 0) {
                    blockIndex = getBlockIndexForPosition(blocks, position, halfSplit, blockIndex);
                    final FileInputSplit fis = new FileInputSplit(splitNum++, file.getPath(), position,
                            bytesUnassigned, blocks[blockIndex].getHosts());
                    inputSplits.add(fis);
                }
            } else {
                // special case with a file of zero bytes size
                final BlockLocation[] blocks = fs.getFileBlockLocations(file, 0, 0);
                String[] hosts;
                if (blocks.length > 0) {
                    hosts = blocks[0].getHosts();
                } else {
                    hosts = new String[0];
                }
                final FileInputSplit fis = new FileInputSplit(splitNum++, file.getPath(), 0, 0, hosts);
                inputSplits.add(fis);
            }
        }

        return inputSplits.toArray(new FileInputSplit[inputSplits.size()]);
    }

    /**
     * Enumerate all files in the directory and recursive if enumerateNestedFiles is true.
     * @return the total length of accepted files.
     */
    private long addFilesInDir(Path path, List<FileStatus> files, boolean logExcludedFiles)
            throws IOException {
        final FileSystem fs = path.getFileSystem();

        long length = 0;

        for(FileStatus dir: fs.listStatus(path)) {
            if (dir.isDir()) {
                if (acceptFile(dir) && enumerateNestedFiles) {
                    length += addFilesInDir(dir.getPath(), files, logExcludedFiles);
                } else {
                    if (logExcludedFiles && LOG.isDebugEnabled()) {
                        LOG.debug("Directory "+dir.getPath().toString()+" did not pass the file-filter and is excluded.");
                    }
                }
            }
            else {
                if(acceptFile(dir)) {
                    files.add(dir);
                    length += dir.getLen();
                    testForUnsplittable(dir);
                } else {
                    if (logExcludedFiles && LOG.isDebugEnabled()) {
                        LOG.debug("Directory "+dir.getPath().toString()+" did not pass the file-filter and is excluded.");
                    }
                }
            }
        }
        return length;
    }

    protected boolean testForUnsplittable(FileStatus pathFile) {
        if(getInflaterInputStreamFactory(pathFile.getPath()) != null) {
            unsplittable = true;
            return true;
        }
        return false;
    }

    private InflaterInputStreamFactory<?> getInflaterInputStreamFactory(Path path) {
        String fileExtension = extractFileExtension(path.getName());
        if (fileExtension != null) {
            return getInflaterInputStreamFactory(fileExtension);
        } else {
            return null;
        }

    }

    /**
     * A simple hook to filter files and directories from the input.
     * The method may be overridden. Hadoop's FileInputFormat has a similar mechanism and applies the
     * same filters by default.
     *
     * @param fileStatus The file status to check.
     * @return true, if the given file or directory is accepted
     */
    public boolean acceptFile(FileStatus fileStatus) {
        final String name = fileStatus.getPath().getName();
        return !name.startsWith("_")
                && !name.startsWith(".")
                && !filesFilter.filterPath(fileStatus.getPath());
    }

    /**
     * Retrieves the index of the <tt>BlockLocation</tt> that contains the part of the file described by the given
     * offset.
     *
     * @param blocks The different blocks of the file. Must be ordered by their offset.
     * @param offset The offset of the position in the file.
     * @param startIndex The earliest index to look at.
     * @return The index of the block containing the given position.
     */
    private int getBlockIndexForPosition(BlockLocation[] blocks, long offset, long halfSplitSize, int startIndex) {
        // go over all indexes after the startIndex
        for (int i = startIndex; i < blocks.length; i++) {
            long blockStart = blocks[i].getOffset();
            long blockEnd = blockStart + blocks[i].getLength();

            if (offset >= blockStart && offset < blockEnd) {
                // got the block where the split starts
                // check if the next block contains more than this one does
                if (i < blocks.length - 1 && blockEnd - offset < halfSplitSize) {
                    return i + 1;
                } else {
                    return i;
                }
            }
        }
        throw new IllegalArgumentException("The given offset is not contained in the any block.");
    }

    // --------------------------------------------------------------------------------------------

    /**
     * Opens an input stream to the file defined in the input format.
     * The stream is positioned at the beginning of the given split.
     * <p>
     * The stream is actually opened in an asynchronous thread to make sure any interruptions to the thread
     * working on the input format do not reach the file system.
     */
    @Override
    public void open(FileInputSplit fileSplit) throws IOException {

        this.currentSplit = fileSplit;
        this.splitStart = fileSplit.getStart();
        this.splitLength = fileSplit.getLength();

        if (LOG.isDebugEnabled()) {
            LOG.debug("Opening input split " + fileSplit.getPath() + " [" + this.splitStart + "," + this.splitLength + "]");
        }


        // open the split in an asynchronous thread
        final InputSplitOpenThread isot = new InputSplitOpenThread(fileSplit, this.openTimeout);
        isot.start();

        try {
            this.stream = isot.waitForCompletion();
            this.stream = decorateInputStream(this.stream, fileSplit);
        }
        catch (Throwable t) {
            throw new IOException("Error opening the Input Split " + fileSplit.getPath() +
                    " [" + splitStart + "," + splitLength + "]: " + t.getMessage(), t);
        }

        // get FSDataInputStream
        if (this.splitStart != 0) {
            this.stream.seek(this.splitStart);
        }
    }

    /**
     * This method allows to wrap/decorate the raw {@link FSDataInputStream} for a certain file split, e.g., for decoding.
     * When overriding this method, also consider adapting {@link FileInputFormat#testForUnsplittable} if your
     * stream decoration renders the input file unsplittable. Also consider calling existing superclass implementations.
     *
     * @param inputStream is the input stream to decorated
     * @param fileSplit   is the file split for which the input stream shall be decorated
     * @return the decorated input stream
     * @throws Throwable if the decoration fails
     * @see org.apache.flink.api.common.io.InputStreamFSInputWrapper
     */
    protected FSDataInputStream decorateInputStream(FSDataInputStream inputStream, FileInputSplit fileSplit) throws Throwable {
        // Wrap stream in a extracting (decompressing) stream if file ends with a known compression file extension.
        InflaterInputStreamFactory<?> inflaterInputStreamFactory = getInflaterInputStreamFactory(fileSplit.getPath());
        if (inflaterInputStreamFactory != null) {
            return new InputStreamFSInputWrapper(inflaterInputStreamFactory.create(stream));
        }

        return inputStream;
    }

    /**
     * Closes the file input stream of the input format.
     */
    @Override
    public void close() throws IOException {
        if (this.stream != null) {
            // close input stream
            this.stream.close();
            stream = null;
        }
    }

    /**
     * Override this method to supports multiple paths.
     * When this method will be removed, all FileInputFormats have to support multiple paths.
     *
     * @return True if the FileInputFormat supports multiple paths, false otherwise.
     *
     * @deprecated Will be removed for Flink 2.0.
     */
    @Deprecated
    public boolean supportsMultiPaths() {
        return false;
    }

    public String toString() {
        return getFilePaths() == null || getFilePaths().length == 0 ?
                "File Input (unknown file)" :
                "File Input (" +  Arrays.toString(this.getFilePaths()) + ')';
    }

    // ============================================================================================

    /**
     * Encapsulation of the basic statistics the optimizer obtains about a file. Contained are the size of the file
     * and the average bytes of a single record. The statistics also have a time-stamp that records the modification
     * time of the file and indicates as such for which time the statistics were valid.
     */
    public static class FileBaseStatistics implements BaseStatistics {

        protected final long fileModTime; // timestamp of the last modification

        protected final long fileSize; // size of the file(s) in bytes

        protected final float avgBytesPerRecord; // the average number of bytes for a record

        /**
         * Creates a new statistics object.
         *
         * @param fileModTime
         *        The timestamp of the latest modification of any of the involved files.
         * @param fileSize
         *        The size of the file, in bytes. <code>-1</code>, if unknown.
         * @param avgBytesPerRecord
         *        The average number of byte in a record, or <code>-1.0f</code>, if unknown.
         */
        public FileBaseStatistics(long fileModTime, long fileSize, float avgBytesPerRecord) {
            this.fileModTime = fileModTime;
            this.fileSize = fileSize;
            this.avgBytesPerRecord = avgBytesPerRecord;
        }

        /**
         * Gets the timestamp of the last modification.
         *
         * @return The timestamp of the last modification.
         */
        public long getLastModificationTime() {
            return fileModTime;
        }

        /**
         * Gets the file size.
         *
         * @return The fileSize.
         * @see org.apache.flink.api.common.io.statistics.BaseStatistics#getTotalInputSize()
         */
        @Override
        public long getTotalInputSize() {
            return this.fileSize;
        }

        /**
         * Gets the estimates number of records in the file, computed as the file size divided by the
         * average record width, rounded up.
         *
         * @return The estimated number of records in the file.
         * @see org.apache.flink.api.common.io.statistics.BaseStatistics#getNumberOfRecords()
         */
        @Override
        public long getNumberOfRecords() {
            return (this.fileSize == SIZE_UNKNOWN || this.avgBytesPerRecord == AVG_RECORD_BYTES_UNKNOWN) ?
                    NUM_RECORDS_UNKNOWN : (long) Math.ceil(this.fileSize / this.avgBytesPerRecord);
        }

        /**
         * Gets the estimated average number of bytes per record.
         *
         * @return The average number of bytes per record.
         * @see org.apache.flink.api.common.io.statistics.BaseStatistics#getAverageRecordWidth()
         */
        @Override
        public float getAverageRecordWidth() {
            return this.avgBytesPerRecord;
        }

        @Override
        public String toString() {
            return "size=" + this.fileSize + ", recWidth=" + this.avgBytesPerRecord + ", modAt=" + this.fileModTime;
        }
    }

    // ============================================================================================

    /**
     * Obtains a DataInputStream in an thread that is not interrupted.
     * This is a necessary hack around the problem that the HDFS client is very sensitive to InterruptedExceptions.
     */
    public static class InputSplitOpenThread extends Thread {

        private final FileInputSplit split;

        private final long timeout;

        private volatile FSDataInputStream fdis;

        private volatile Throwable error;

        private volatile boolean aborted;

        public InputSplitOpenThread(FileInputSplit split, long timeout) {
            super("Transient InputSplit Opener");
            setDaemon(true);

            this.split = split;
            this.timeout = timeout;
        }

        @Override
        public void run() {
            try {
                final FileSystem fs = FileSystem.get(this.split.getPath().toUri());
                this.fdis = fs.open(this.split.getPath());

                // check for canceling and close the stream in that case, because no one will obtain it
                if (this.aborted) {
                    final FSDataInputStream f = this.fdis;
                    this.fdis = null;
                    f.close();
                }
            }
            catch (Throwable t) {
                this.error = t;
            }
        }

        public FSDataInputStream waitForCompletion() throws Throwable {
            final long start = System.currentTimeMillis();
            long remaining = this.timeout;

            do {
                try {
                    // wait for the task completion
                    this.join(remaining);
                }
                catch (InterruptedException iex) {
                    // we were canceled, so abort the procedure
                    abortWait();
                    throw iex;
                }
            }
            while (this.error == null && this.fdis == null &&
                    (remaining = this.timeout + start - System.currentTimeMillis()) > 0);

            if (this.error != null) {
                throw this.error;
            }
            if (this.fdis != null) {
                return this.fdis;
            } else {
                // double-check that the stream has not been set by now. we don't know here whether
                // a) the opener thread recognized the canceling and closed the stream
                // b) the flag was set such that the stream did not see it and we have a valid stream
                // In any case, close the stream and throw an exception.
                abortWait();

                final boolean stillAlive = this.isAlive();
                final StringBuilder bld = new StringBuilder(256);
                for (StackTraceElement e : this.getStackTrace()) {
                    bld.append("\tat ").append(e.toString()).append('\n');
                }
                throw new IOException("Input opening request timed out. Opener was " + (stillAlive ? "" : "NOT ") +
                        " alive. Stack of split open thread:\n" + bld.toString());
            }
        }

        /**
         * Double checked procedure setting the abort flag and closing the stream.
         */
        private void abortWait() {
            this.aborted = true;
            final FSDataInputStream inStream = this.fdis;
            this.fdis = null;
            if (inStream != null) {
                try {
                    inStream.close();
                } catch (Throwable t) {}
            }
        }
    }

    // ============================================================================================
    //  Parameterization via configuration
    // ============================================================================================

    // ------------------------------------- Config Keys ------------------------------------------

    /**
     * The config parameter which defines the input file path.
     */
    private static final String FILE_PARAMETER_KEY = "input.file.path";

    /**
     * The config parameter which defines whether input directories are recursively traversed.
     */
    public static final String ENUMERATE_NESTED_FILES_FLAG = "recursive.file.enumeration";
}
  1. OrcMultiRowInputFormat
import com.hansight.MultiFileInputFormat;
import org.apache.flink.annotation.VisibleForTesting;
import org.apache.flink.api.common.io.FileInputFormat;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.typeutils.ResultTypeQueryable;
import org.apache.flink.api.java.typeutils.RowTypeInfo;
import org.apache.flink.core.fs.FileInputSplit;
import org.apache.flink.core.fs.Path;
import org.apache.flink.types.Row;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.common.type.HiveDecimal;
import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
import org.apache.hadoop.hive.ql.io.sarg.PredicateLeaf;
import org.apache.hadoop.hive.ql.io.sarg.SearchArgument;
import org.apache.hadoop.hive.ql.io.sarg.SearchArgumentFactory;
import org.apache.hadoop.hive.serde2.io.HiveDecimalWritable;
import org.apache.orc.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
import java.math.BigDecimal;
import java.sql.Date;
import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import static com.hansight.orc.OrcBatchReader.fillRows;

/**
 * InputFormat to read ORC files.
 */
public class OrcMultiRowInputFormat extends MultiFileInputFormat<Row> implements ResultTypeQueryable<Row> {

	private static final Logger LOG = LoggerFactory.getLogger(OrcMultiRowInputFormat.class);
	// the number of rows read in a batch
	private static final int DEFAULT_BATCH_SIZE = 1000;

	// the number of fields rows to read in a batch
	private int batchSize;
	// the configuration to read with
	private Configuration conf;
	// the schema of the ORC files to read
	private TypeDescription schema;

	// the fields of the ORC schema that the returned Rows are composed of.
	private int[] selectedFields;
	// the type information of the Rows returned by this InputFormat.
	private transient RowTypeInfo rowType;

	// the ORC reader
	private transient RecordReader orcRowsReader;
	// the vectorized row data to be read in a batch
	private transient VectorizedRowBatch rowBatch;
	// the vector of rows that is read in a batch
	private transient Row[] rows;

	// the number of rows in the current batch
	private transient int rowsInBatch;
	// the index of the next row to return
	private transient int nextRow;

	private ArrayList<Predicate> conjunctPredicates = new ArrayList<>();

	/**
	 * Creates an OrcRowInputFormat.
	 *
	 * @param path The path to read ORC files from.
	 * @param schemaString The schema of the ORC files as String.
	 * @param orcConfig The configuration to read the ORC files with.
	 */
	public OrcMultiRowInputFormat(String schemaString, Configuration orcConfig, Path... path) {
		this(TypeDescription.fromString(schemaString), orcConfig, DEFAULT_BATCH_SIZE,path);
	}

	/**
	 * Creates an OrcRowInputFormat.
	 *
	 * @param path The path to read ORC files from.
	 * @param schemaString The schema of the ORC files as String.
	 * @param orcConfig The configuration to read the ORC files with.
	 * @param batchSize The number of Row objects to read in a batch.
	 */
	public OrcMultiRowInputFormat(String schemaString, Configuration orcConfig, int batchSize, Path... path) {
		this(TypeDescription.fromString(schemaString), orcConfig, batchSize,path);
	}

	/**
	 * Creates an OrcRowInputFormat.
	 *
	 * @param paths The path to read ORC files from.
	 * @param orcSchema The schema of the ORC files as ORC TypeDescription.
	 * @param orcConfig The configuration to read the ORC files with.
	 * @param batchSize The number of Row objects to read in a batch.
	 */
	public OrcMultiRowInputFormat(TypeDescription orcSchema, Configuration orcConfig, int batchSize,Path... paths) {
        super(paths);

		// configure OrcRowInputFormat
		this.schema = orcSchema;
		this.rowType = (RowTypeInfo) OrcBatchReader.schemaToTypeInfo(schema);
		this.conf = orcConfig;
		this.batchSize = batchSize;

		// set default selection mask, i.e., all fields.
		this.selectedFields = new int[this.schema.getChildren().size()];
		for (int i = 0; i < selectedFields.length; i++) {
			this.selectedFields[i] = i;
		}
	}

	/**
	 * Adds a filter predicate to reduce the number of rows to be returned by the input format.
	 * Multiple conjunctive predicates can be added by calling this method multiple times.
	 *
	 * <p>Note: Predicates can significantly reduce the amount of data that is read.
	 * However, the OrcRowInputFormat does not guarantee that all returned rows qualify the
	 * predicates. Moreover, predicates are only applied if the referenced field is among the
	 * selected fields.
	 *
	 * @param predicate The filter predicate.
	 */
	public void addPredicate(Predicate predicate) {
		// validate
		validatePredicate(predicate);
		// add predicate
		this.conjunctPredicates.add(predicate);
	}

	private void validatePredicate(Predicate pred) {
		if (pred instanceof ColumnPredicate) {
			// check column name
			String colName = ((ColumnPredicate) pred).columnName;
			if (!this.schema.getFieldNames().contains(colName)) {
				throw new IllegalArgumentException("Predicate cannot be applied. " +
					"Column '" + colName + "' does not exist in ORC schema.");
			}
		} else if (pred instanceof Not) {
			validatePredicate(((Not) pred).child());
		} else if (pred instanceof Or) {
			for (Predicate p : ((Or) pred).children()) {
				validatePredicate(p);
			}
		}
	}

	/**
	 * Selects the fields from the ORC schema that are returned by InputFormat.
	 *
	 * @param selectedFields The indices of the fields of the ORC schema that are returned by the InputFormat.
	 */
	public void selectFields(int... selectedFields) {
		// set field mapping
		this.selectedFields = selectedFields;
		// adapt result type
		this.rowType = RowTypeInfo.projectFields(this.rowType, selectedFields);
	}

	/**
	 * Computes the ORC projection mask of the fields to include from the selected fields.rowOrcInputFormat.nextRecord(null).
	 *
	 * @return The ORC projection mask.
	 */
	private boolean[] computeProjectionMask() {
		// mask with all fields of the schema
		boolean[] projectionMask = new boolean[schema.getMaximumId() + 1];
		// for each selected field
		for (int inIdx : selectedFields) {
			// set all nested fields of a selected field to true
			TypeDescription fieldSchema = schema.getChildren().get(inIdx);
			for (int i = fieldSchema.getId(); i <= fieldSchema.getMaximumId(); i++) {
				projectionMask[i] = true;
			}
		}
		return projectionMask;
	}

	@Override
	public void openInputFormat() throws IOException {
		super.openInputFormat();
		// create and initialize the row batch
		this.rows = new Row[batchSize];
		for (int i = 0; i < batchSize; i++) {
			rows[i] = new Row(selectedFields.length);
		}
	}

	@Override
	public void open(FileInputSplit fileSplit) throws IOException {

		LOG.debug("Opening ORC file {}", fileSplit.getPath());

		// open ORC file and create reader
		org.apache.hadoop.fs.Path hPath = new org.apache.hadoop.fs.Path(fileSplit.getPath().getPath());
		Reader orcReader = OrcFile.createReader(hPath, OrcFile.readerOptions(conf));

		// get offset and length for the stripes that start in the split
		Tuple2<Long, Long> offsetAndLength = getOffsetAndLengthForSplit(fileSplit, getStripes(orcReader));

		// create ORC row reader configuration
		Reader.Options options = getOptions(orcReader)
			.schema(schema)
			.range(offsetAndLength.f0, offsetAndLength.f1)
			.useZeroCopy(OrcConf.USE_ZEROCOPY.getBoolean(conf))
			.skipCorruptRecords(OrcConf.SKIP_CORRUPT_DATA.getBoolean(conf))
			.tolerateMissingSchema(OrcConf.TOLERATE_MISSING_SCHEMA.getBoolean(conf));

		// configure filters
		if (!conjunctPredicates.isEmpty()) {
			SearchArgument.Builder b = SearchArgumentFactory.newBuilder();
			b = b.startAnd();
			for (Predicate predicate : conjunctPredicates) {
				predicate.add(b);
			}
			b = b.end();
			options.searchArgument(b.build(), new String[]{});
		}

		// configure selected fields
		options.include(computeProjectionMask());

		// create ORC row reader
		this.orcRowsReader = orcReader.rows(options);

		// assign ids
		this.schema.getId();
		// create row batch
		this.rowBatch = schema.createRowBatch(batchSize);
		rowsInBatch = 0;
		nextRow = 0;
	}

	@VisibleForTesting
	Reader.Options getOptions(Reader orcReader) {
		return orcReader.options();
	}

	@VisibleForTesting
	List<StripeInformation> getStripes(Reader orcReader) {
		return orcReader.getStripes();
	}

	private Tuple2<Long, Long> getOffsetAndLengthForSplit(FileInputSplit split, List<StripeInformation> stripes) {
		long splitStart = split.getStart();
		long splitEnd = splitStart + split.getLength();

		long readStart = Long.MAX_VALUE;
		long readEnd = Long.MIN_VALUE;

		for (StripeInformation s : stripes) {
			if (splitStart <= s.getOffset() && s.getOffset() < splitEnd) {
				// stripe starts in split, so it is included
				readStart = Math.min(readStart, s.getOffset());
				readEnd = Math.max(readEnd, s.getOffset() + s.getLength());
			}
		}

		if (readStart < Long.MAX_VALUE) {
			// at least one split is included
			return Tuple2.of(readStart, readEnd - readStart);
		} else {
			return Tuple2.of(0L, 0L);
		}
	}

	@Override
	public void close() throws IOException {
		if (orcRowsReader != null) {
			this.orcRowsReader.close();
		}
		this.orcRowsReader = null;
	}

	@Override
	public void closeInputFormat() throws IOException {
		this.rows = null;
		this.schema = null;
		this.rowBatch = null;
	}

	@Override
	public boolean reachedEnd() throws IOException {
		return !ensureBatch();
	}

	/**
	 * Checks if there is at least one row left in the batch to return.
	 * If no more row are available, it reads another batch of rows.
	 *
	 * @return Returns true if there is one more row to return, false otherwise.
	 * @throws IOException throw if an exception happens while reading a batch.
	 */
	private boolean ensureBatch() throws IOException {

		if (nextRow >= rowsInBatch) {
			// No more rows available in the Rows array.
			nextRow = 0;
			// Try to read the next batch if rows from the ORC file.
			boolean moreRows = orcRowsReader.nextBatch(rowBatch);

			if (moreRows) {
				// Load the data into the Rows array.
				rowsInBatch = fillRows(rows, schema, rowBatch, selectedFields);
			}
			return moreRows;
		}
		// there is at least one Row left in the Rows array.
		return true;
	}

	@Override
	public Row nextRecord(Row reuse) throws IOException {
		// return the next row
		return rows[this.nextRow++];
	}

	@Override
	public TypeInformation<Row> getProducedType() {
		return rowType;
	}

	// --------------------------------------------------------------------------------------------
	//  Custom serialization methods
	// --------------------------------------------------------------------------------------------

	private void writeObject(ObjectOutputStream out) throws IOException {
		out.writeInt(batchSize);
		this.conf.write(out);
		out.writeUTF(schema.toString());

		out.writeInt(selectedFields.length);
		for (int f : selectedFields) {
			out.writeInt(f);
		}

		out.writeInt(conjunctPredicates.size());
		for (Predicate p : conjunctPredicates) {
			out.writeObject(p);
		}
	}

	@SuppressWarnings("unchecked")
	private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
		batchSize = in.readInt();
		Configuration configuration = new Configuration();
		configuration.readFields(in);

		if (this.conf == null) {
			this.conf = configuration;
		}
		this.schema = TypeDescription.fromString(in.readUTF());

		this.selectedFields = new int[in.readInt()];
		for (int i = 0; i < selectedFields.length; i++) {
			this.selectedFields[i] = in.readInt();
		}

		this.conjunctPredicates = new ArrayList<>();
		int numPreds = in.readInt();
		for (int i = 0; i < numPreds; i++) {
			conjunctPredicates.add((Predicate) in.readObject());
		}
	}

	@Override
	public boolean supportsMultiPaths() {
		return true;
	}

	// --------------------------------------------------------------------------------------------
	//  Getter methods for tests
	// --------------------------------------------------------------------------------------------

	@VisibleForTesting
    Configuration getConfiguration() {
		return conf;
	}

	@VisibleForTesting
	int getBatchSize() {
		return batchSize;
	}

	@VisibleForTesting
	String getSchema() {
		return schema.toString();
	}

	// --------------------------------------------------------------------------------------------
	//  Classes to define predicates
	// --------------------------------------------------------------------------------------------

	/**
	 * A filter predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public abstract static class Predicate implements Serializable {
		protected abstract SearchArgument.Builder add(SearchArgument.Builder builder);
	}

	abstract static class ColumnPredicate extends Predicate {
		final String columnName;
		final PredicateLeaf.Type literalType;

		ColumnPredicate(String columnName, PredicateLeaf.Type literalType) {
			this.columnName = columnName;
			this.literalType = literalType;
		}

		Object castLiteral(Serializable literal) {

			switch (literalType) {
				case LONG:
					if (literal instanceof Byte) {
						return new Long((Byte) literal);
					} else if (literal instanceof Short) {
						return new Long((Short) literal);
					} else if (literal instanceof Integer) {
						return new Long((Integer) literal);
					} else if (literal instanceof Long) {
						return literal;
					} else {
						throw new IllegalArgumentException("A predicate on a LONG column requires an integer " +
							"literal, i.e., Byte, Short, Integer, or Long.");
					}
				case FLOAT:
					if (literal instanceof Float) {
						return new Double((Float) literal);
					} else if (literal instanceof Double) {
						return literal;
					} else if (literal instanceof BigDecimal) {
						return ((BigDecimal) literal).doubleValue();
					} else {
						throw new IllegalArgumentException("A predicate on a FLOAT column requires a floating " +
							"literal, i.e., Float or Double.");
					}
				case STRING:
					if (literal instanceof String) {
						return literal;
					} else {
						throw new IllegalArgumentException("A predicate on a STRING column requires a floating " +
							"literal, i.e., Float or Double.");
					}
				case BOOLEAN:
					if (literal instanceof Boolean) {
						return literal;
					} else {
						throw new IllegalArgumentException("A predicate on a BOOLEAN column requires a Boolean literal.");
					}
				case DATE:
					if (literal instanceof Date) {
						return literal;
					} else {
						throw new IllegalArgumentException("A predicate on a DATE column requires a java.sql.Date literal.");
					}
				case TIMESTAMP:
					if (literal instanceof Timestamp) {
						return literal;
					} else {
						throw new IllegalArgumentException("A predicate on a TIMESTAMP column requires a java.sql.Timestamp literal.");
					}
				case DECIMAL:
					if (literal instanceof BigDecimal) {
						return new HiveDecimalWritable(HiveDecimal.create((BigDecimal) literal));
					} else {
						throw new IllegalArgumentException("A predicate on a DECIMAL column requires a BigDecimal literal.");
					}
				default:
					throw new IllegalArgumentException("Unknown literal type " + literalType);
			}
		}
	}

	abstract static class BinaryPredicate extends ColumnPredicate {
		final Serializable literal;

		BinaryPredicate(String columnName, PredicateLeaf.Type literalType, Serializable literal) {
			super(columnName, literalType);
			this.literal = literal;
		}
	}

	/**
	 * An EQUALS predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public static class Equals extends BinaryPredicate {
		/**
		 * Creates an EQUALS predicate.
		 *
		 * @param columnName The column to check.
		 * @param literalType The type of the literal.
		 * @param literal The literal value to check the column against.
		 */
		public Equals(String columnName, PredicateLeaf.Type literalType, Serializable literal) {
			super(columnName, literalType, literal);
		}

		@Override
		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			return builder.equals(columnName, literalType, castLiteral(literal));
		}

		@Override
		public String toString() {
			return columnName + " = " + literal;
		}
	}

	/**
	 * An EQUALS predicate that can be evaluated with Null safety by the OrcRowInputFormat.
	 */
	public static class NullSafeEquals extends BinaryPredicate {
		/**
		 * Creates a null-safe EQUALS predicate.
		 *
		 * @param columnName The column to check.
		 * @param literalType The type of the literal.
		 * @param literal The literal value to check the column against.
		 */
		public NullSafeEquals(String columnName, PredicateLeaf.Type literalType, Serializable literal) {
			super(columnName, literalType, literal);
		}

		@Override
		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			return builder.nullSafeEquals(columnName, literalType, castLiteral(literal));
		}

		@Override
		public String toString() {
			return columnName + " = " + literal;
		}
	}

	/**
	 * A LESS_THAN predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public static class LessThan extends BinaryPredicate {
		/**
		 * Creates a LESS_THAN predicate.
		 *
		 * @param columnName The column to check.
		 * @param literalType The type of the literal.
		 * @param literal The literal value to check the column against.
		 */
		public LessThan(String columnName, PredicateLeaf.Type literalType, Serializable literal) {
			super(columnName, literalType, literal);
		}

		@Override
		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			return builder.lessThan(columnName, literalType, castLiteral(literal));
		}

		@Override
		public String toString() {
			return columnName + " < " + literal;
		}
	}

	/**
	 * A LESS_THAN_EQUALS predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public static class LessThanEquals extends BinaryPredicate {
		/**
		 * Creates a LESS_THAN_EQUALS predicate.
		 *
		 * @param columnName The column to check.
		 * @param literalType The type of the literal.
		 * @param literal The literal value to check the column against.
		 */
		public LessThanEquals(String columnName, PredicateLeaf.Type literalType, Serializable literal) {
			super(columnName, literalType, literal);
		}

		@Override
		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			return builder.lessThanEquals(columnName, literalType, castLiteral(literal));
		}

		@Override
		public String toString() {
			return columnName + " <= " + literal;
		}
	}

	/**
	 * An IS_NULL predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public static class IsNull extends ColumnPredicate {
		/**
		 * Creates an IS_NULL predicate.
		 *
		 * @param columnName The column to check for null.
		 * @param literalType The type of the column to check for null.
		 */
		public IsNull(String columnName, PredicateLeaf.Type literalType) {
			super(columnName, literalType);
		}

		@Override
		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			return builder.isNull(columnName, literalType);
		}

		@Override
		public String toString() {
			return columnName + " IS NULL";
		}
	}

	/**
	 * An BETWEEN predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public static class Between extends ColumnPredicate {
		private Serializable lowerBound;
		private Serializable upperBound;

		/**
		 * Creates an BETWEEN predicate.
		 *
		 * @param columnName The column to check.
		 * @param literalType The type of the literals.
		 * @param lowerBound The literal value of the (inclusive) lower bound to check the column against.
		 * @param upperBound The literal value of the (inclusive) upper bound to check the column against.
		 */
		public Between(String columnName, PredicateLeaf.Type literalType, Serializable lowerBound, Serializable upperBound) {
			super(columnName, literalType);
			this.lowerBound = lowerBound;
			this.upperBound = upperBound;
		}

		@Override
		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			return builder.between(columnName, literalType, castLiteral(lowerBound), castLiteral(upperBound));
		}

		@Override
		public String toString() {
			return lowerBound + " <= " + columnName + " <= " + upperBound;
		}
	}

	/**
	 * An IN predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public static class In extends ColumnPredicate {
		private Serializable[] literals;

		/**
		 * Creates an IN predicate.
		 *
		 * @param columnName The column to check.
		 * @param literalType The type of the literals.
		 * @param literals The literal values to check the column against.
		 */
		public In(String columnName, PredicateLeaf.Type literalType, Serializable... literals) {
			super(columnName, literalType);
			this.literals = literals;
		}

		@Override
		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			Object[] castedLiterals = new Object[literals.length];
			for (int i = 0; i < literals.length; i++) {
				castedLiterals[i] = castLiteral(literals[i]);
			}
			return builder.in(columnName, literalType, (Object[]) castedLiterals);
		}

		@Override
		public String toString() {
			return columnName + " IN " + Arrays.toString(literals);
		}
	}

	/**
	 * A NOT predicate to negate a predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public static class Not extends Predicate {
		private final Predicate pred;

		/**
		 * Creates a NOT predicate.
		 *
		 * @param predicate The predicate to negate.
		 */
		public Not(Predicate predicate) {
			this.pred = predicate;
		}

		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			return pred.add(builder.startNot()).end();
		}

		protected Predicate child() {
			return pred;
		}

		@Override
		public String toString() {
			return "NOT(" + pred.toString() + ")";
		}
	}

	/**
	 * An OR predicate that can be evaluated by the OrcRowInputFormat.
	 */
	public static class Or extends Predicate {
		private final Predicate[] preds;

		/**
		 * Creates an OR predicate.
		 *
		 * @param predicates The disjunctive predicates.
		 */
		public Or(Predicate... predicates) {
			this.preds = predicates;
		}

		@Override
		protected SearchArgument.Builder add(SearchArgument.Builder builder) {
			SearchArgument.Builder withOr = builder.startOr();
			for (Predicate p : preds) {
				withOr = p.add(withOr);
			}
			return withOr.end();
		}

		protected Iterable<Predicate> children() {
			return Arrays.asList(preds);
		}

		@Override
		public String toString() {
			return "OR(" + Arrays.toString(preds) + ")";
		}
	}
}
  1. OrcTableSource
import org.apache.flink.annotation.VisibleForTesting;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.SqlTimeTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.typeutils.RowTypeInfo;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.TableSchema;
import org.apache.flink.table.expressions.*;
import org.apache.flink.table.expressions.IsNull;
import org.apache.flink.table.expressions.LessThan;
import org.apache.flink.table.expressions.Not;
import org.apache.flink.table.expressions.Or;
import org.apache.flink.table.sources.*;
import org.apache.flink.table.sources.tsextractors.ExistingField;
import org.apache.flink.table.sources.wmstrategies.BoundedOutOfOrderTimestamps;
import org.apache.flink.types.Row;
import org.apache.flink.util.Preconditions;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.ql.io.sarg.PredicateLeaf;
import org.apache.orc.TypeDescription;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import com.hansight.orc.OrcMultiRowInputFormat.*;

public class OrcTableSource
	implements StreamTableSource<Row>,ProjectableTableSource<Row>, FilterableTableSource<Row> ,DefinedRowtimeAttributes{

	private static final Logger LOG = LoggerFactory.getLogger(OrcTableSource.class);

	private static final int DEFAULT_BATCH_SIZE = 1000;

	// path to read ORC files from
	private Path[] paths;
	// schema of the ORC file
	private final TypeDescription orcSchema;
	// the schema of the Table
	private final TableSchema tableSchema;
	// the configuration to read the file
	private final Configuration orcConfig;
	// the number of rows to read in a batch
	private final int batchSize;
	// flag whether a path is recursively enumerated
	private final boolean recursiveEnumeration;

	// type information of the data returned by the InputFormat
	private final RowTypeInfo typeInfo;
	// list of selected ORC fields to return
	private final int[] selectedFields;
	// list of predicates to apply
	private final Predicate[] predicates;

    private String rowtimeField;

	/**
	 * Creates an OrcTableSouce from an ORC TypeDescription.
	 *
	 * @param paths		The path to read the ORC files from.
	 * @param orcSchema The schema of the ORC files as TypeDescription.
	 * @param orcConfig The configuration to read the ORC files.
	 * @param batchSize The number of Rows to read in a batch, default is 1000.
	 * @param recursiveEnumeration Flag whether the path should be recursively enumerated or not.
	 */
	private OrcTableSource(TypeDescription orcSchema, Configuration orcConfig, int batchSize, boolean recursiveEnumeration,String rowtimeField, Path... paths) {
		this(orcSchema, orcConfig, batchSize, recursiveEnumeration, null, null, rowtimeField,paths);
	}

	private OrcTableSource(TypeDescription orcSchema, Configuration orcConfig,
							int batchSize, boolean recursiveEnumeration,
							int[] selectedFields, OrcMultiRowInputFormat.Predicate[] predicates, String rowtimeField, Path... paths) {

		Preconditions.checkNotNull(paths, "Path must not be null.");
		Preconditions.checkNotNull(orcSchema, "OrcSchema must not be null.");
		Preconditions.checkNotNull(paths, "Configuration must not be null.");
		Preconditions.checkArgument(batchSize > 0, "Batch size must be larger than null.");
		this.paths = paths;
		this.orcSchema = orcSchema;
		this.orcConfig = orcConfig;
		this.batchSize = batchSize;
		this.recursiveEnumeration = recursiveEnumeration;
		this.selectedFields = selectedFields;
		this.predicates = predicates;
        this.rowtimeField = rowtimeField;

		// determine the type information from the ORC schema
		RowTypeInfo typeInfoFromSchema = (RowTypeInfo) OrcBatchReader.schemaToTypeInfo(this.orcSchema);

		// set return type info
		if (selectedFields == null) {
			this.typeInfo = typeInfoFromSchema;
		} else {
			this.typeInfo = RowTypeInfo.projectFields(typeInfoFromSchema, selectedFields);
		}

		// create a TableSchema that corresponds to the ORC schema
		this.tableSchema = new TableSchema(
			typeInfoFromSchema.getFieldNames(),
			typeInfoFromSchema.getFieldTypes()
		);
	}

	@VisibleForTesting
	protected OrcMultiRowInputFormat buildOrcInputFormat() {
		return new OrcMultiRowInputFormat(orcSchema, orcConfig, batchSize, paths);
	}

	@Override
	public TypeInformation<Row> getReturnType() {
		return typeInfo;
	}

	@Override
	public TableSchema getTableSchema() {
		return this.tableSchema;
	}

	@Override
	public TableSource<Row> projectFields(int[] selectedFields) {
		// create a copy of the OrcTableSouce with new selected fields
		return new OrcTableSource(orcSchema, orcConfig, batchSize, recursiveEnumeration, selectedFields, predicates, rowtimeField,paths);
	}

	@Override
	public TableSource<Row> applyPredicate(List<Expression> predicates) {
		ArrayList<OrcMultiRowInputFormat.Predicate> orcPredicates = new ArrayList<>();

		// we do not remove any predicates from the list because ORC does not fully apply predicates
		for (Expression pred : predicates) {
			OrcMultiRowInputFormat.Predicate orcPred = toOrcPredicate(pred);
			if (orcPred != null) {
				LOG.info("Predicate [{}] converted into OrcPredicate [{}] and pushed into OrcTableSource for path {}.", pred, orcPred, paths);
				orcPredicates.add(orcPred);
			} else {
				LOG.info("Predicate [{}] could not be pushed into OrcTableSource for path {}.", pred, paths);
			}
		}

		return new OrcTableSource(orcSchema, orcConfig, batchSize, recursiveEnumeration, selectedFields, orcPredicates.toArray(new Predicate[]{}), rowtimeField, paths);
	}

	@Override
	public boolean isFilterPushedDown() {
		return this.predicates != null;
	}

	@Override
	public String explainSource() {
		return "OrcFile[path=" + paths + ", schema=" + orcSchema + ", filter=" + predicateString()
			+ ", selectedFields=" + Arrays.toString(selectedFields) + "]";
	}

	private String predicateString() {
		if (predicates == null || predicates.length == 0) {
			return "TRUE";
		} else {
			return "AND(" + Arrays.toString(predicates) + ")";
		}
	}

	// Predicate conversion for filter push-down.

	private Predicate toOrcPredicate(Expression pred) {
		if (pred instanceof Or) {
			Predicate c1 = toOrcPredicate(((Or) pred).left());
			Predicate c2 = toOrcPredicate(((Or) pred).right());
			if (c1 == null || c2 == null) {
				return null;
			} else {
				return new OrcMultiRowInputFormat.Or(c1, c2);
			}
		} else if (pred instanceof Not) {
			Predicate c = toOrcPredicate(((Not) pred).child());
			if (c == null) {
				return null;
			} else {
				return new OrcMultiRowInputFormat.Not(c);
			}
		} else if (pred instanceof BinaryComparison) {

			BinaryComparison binComp = (BinaryComparison) pred;

			if (!isValid(binComp)) {
				// not a valid predicate
				LOG.debug("Unsupported predicate [{}] cannot be pushed into OrcTableSource.", pred);
				return null;
			}
			PredicateLeaf.Type litType = getLiteralType(binComp);
			if (litType == null) {
				// unsupported literal type
				LOG.debug("Unsupported predicate [{}] cannot be pushed into OrcTableSource.", pred);
				return null;
			}

			boolean literalOnRight = literalOnRight(binComp);
			String colName = getColumnName(binComp);

			// fetch literal and ensure it is serializable
			Object literalObj = getLiteral(binComp);
			Serializable literal;
			// validate that literal is serializable
			if (literalObj instanceof Serializable) {
				literal = (Serializable) literalObj;
			} else {
				LOG.warn("Encountered a non-serializable literal of type {}. " +
						"Cannot push predicate [{}] into OrcTableSource. " +
						"This is a bug and should be reported.",
						literalObj.getClass().getCanonicalName(), pred);
				return null;
			}

			if (pred instanceof EqualTo) {
				return new OrcMultiRowInputFormat.Equals(colName, litType, literal);
			} else if (pred instanceof NotEqualTo) {
				return new OrcMultiRowInputFormat.Not(
					new OrcMultiRowInputFormat.Equals(colName, litType, literal));
			} else if (pred instanceof GreaterThan) {
				if (literalOnRight) {
					return new OrcMultiRowInputFormat.Not(
						new OrcMultiRowInputFormat.LessThanEquals(colName, litType, literal));
				} else {
					return new OrcMultiRowInputFormat.LessThan(colName, litType, literal);
				}
			} else if (pred instanceof GreaterThanOrEqual) {
				if (literalOnRight) {
					return new OrcMultiRowInputFormat.Not(
						new OrcMultiRowInputFormat.LessThan(colName, litType, literal));
				} else {
					return new OrcMultiRowInputFormat.LessThanEquals(colName, litType, literal);
				}
			} else if (pred instanceof LessThan) {
				if (literalOnRight) {
					return new OrcMultiRowInputFormat.LessThan(colName, litType, literal);
				} else {
					return new OrcMultiRowInputFormat.Not(
						new OrcMultiRowInputFormat.LessThanEquals(colName, litType, literal));
				}
			} else if (pred instanceof LessThanOrEqual) {
				if (literalOnRight) {
					return new OrcMultiRowInputFormat.LessThanEquals(colName, litType, literal);
				} else {
					return new OrcMultiRowInputFormat.Not(
						new OrcMultiRowInputFormat.LessThan(colName, litType, literal));
				}
			} else {
				// unsupported predicate
				LOG.debug("Unsupported predicate [{}] cannot be pushed into OrcTableSource.", pred);
				return null;
			}
		} else if (pred instanceof UnaryExpression) {

			UnaryExpression unary = (UnaryExpression) pred;
			if (!isValid(unary)) {
				// not a valid predicate
				LOG.debug("Unsupported predicate [{}] cannot be pushed into OrcTableSource.", pred);
				return null;
			}
			PredicateLeaf.Type colType = toOrcType(((UnaryExpression) pred).child().resultType());
			if (colType == null) {
				// unsupported type
				LOG.debug("Unsupported predicate [{}] cannot be pushed into OrcTableSource.", pred);
				return null;
			}

			String colName = getColumnName(unary);

			if (pred instanceof IsNull) {
				return new OrcMultiRowInputFormat.IsNull(colName, colType);
			} else if (pred instanceof IsNotNull) {
				return new OrcMultiRowInputFormat.Not(
					new OrcMultiRowInputFormat.IsNull(colName, colType));
			} else {
				// unsupported predicate
				LOG.debug("Unsupported predicate [{}] cannot be pushed into OrcTableSource.", pred);
				return null;
			}
		} else {
			// unsupported predicate
			LOG.debug("Unsupported predicate [{}] cannot be pushed into OrcTableSource.", pred);
			return null;
		}
	}

	private boolean isValid(UnaryExpression unary) {
		return unary.child() instanceof Attribute;
	}

	private boolean isValid(BinaryComparison comp) {
		return (comp.left() instanceof Literal && comp.right() instanceof Attribute) ||
			(comp.left() instanceof Attribute && comp.right() instanceof Literal);
	}

	private boolean literalOnRight(BinaryComparison comp) {
		if (comp.left() instanceof Literal && comp.right() instanceof Attribute) {
			return false;
		} else if (comp.left() instanceof Attribute && comp.right() instanceof Literal) {
			return true;
		} else {
			throw new RuntimeException("Invalid binary comparison.");
		}
	}

	private String getColumnName(UnaryExpression unary) {
		return ((Attribute) unary.child()).name();
	}

	private String getColumnName(BinaryComparison comp) {
		if (literalOnRight(comp)) {
			return ((Attribute) comp.left()).name();
		} else {
			return ((Attribute) comp.right()).name();
		}
	}

	private PredicateLeaf.Type getLiteralType(BinaryComparison comp) {
		if (literalOnRight(comp)) {
			return toOrcType(((Literal) comp.right()).resultType());
		} else {
			return toOrcType(((Literal) comp.left()).resultType());
		}
	}

	private Object getLiteral(BinaryComparison comp) {
		if (literalOnRight(comp)) {
			return ((Literal) comp.right()).value();
		} else {
			return ((Literal) comp.left()).value();
		}
	}

	private PredicateLeaf.Type toOrcType(TypeInformation<?> type) {
		if (type == BasicTypeInfo.BYTE_TYPE_INFO ||
			type == BasicTypeInfo.SHORT_TYPE_INFO ||
			type == BasicTypeInfo.INT_TYPE_INFO ||
			type == BasicTypeInfo.LONG_TYPE_INFO) {
			return PredicateLeaf.Type.LONG;
		} else if (type == BasicTypeInfo.FLOAT_TYPE_INFO ||
			type == BasicTypeInfo.DOUBLE_TYPE_INFO) {
			return PredicateLeaf.Type.FLOAT;
		} else if (type == BasicTypeInfo.BOOLEAN_TYPE_INFO) {
			return PredicateLeaf.Type.BOOLEAN;
		} else if (type == BasicTypeInfo.STRING_TYPE_INFO) {
			return PredicateLeaf.Type.STRING;
		} else if (type == SqlTimeTypeInfo.TIMESTAMP) {
			return PredicateLeaf.Type.TIMESTAMP;
		} else if (type == SqlTimeTypeInfo.DATE) {
			return PredicateLeaf.Type.DATE;
		} else if (type == BasicTypeInfo.BIG_DEC_TYPE_INFO) {
			return PredicateLeaf.Type.DECIMAL;
		} else {
			// unsupported type
			return null;
		}
	}

	// Builder

	public static Builder builder() {
		return new Builder();
	}

    @Override
    public DataStream<Row> getDataStream(StreamExecutionEnvironment streamExecutionEnvironment) {
        OrcMultiRowInputFormat orcIF = buildOrcInputFormat();
        orcIF.setNestedFileEnumeration(recursiveEnumeration);
        if (selectedFields != null) {
            orcIF.selectFields(selectedFields);
        }
        if (predicates != null) {
            for (OrcMultiRowInputFormat.Predicate pred : predicates) {
                orcIF.addPredicate(pred);
            }
        }
        return streamExecutionEnvironment.createInput(orcIF).name(explainSource());
    }

    @Override
    public List<RowtimeAttributeDescriptor> getRowtimeAttributeDescriptors() {
        RowtimeAttributeDescriptor rowtimeAttrDescr = new RowtimeAttributeDescriptor(
                rowtimeField,
                new ExistingField(rowtimeField),
                new BoundedOutOfOrderTimestamps(7200000));
        List<RowtimeAttributeDescriptor> listRowtimeAttrDescr = Collections.singletonList(rowtimeAttrDescr);
        return listRowtimeAttrDescr;
    }

    /**
	 * Constructs an {@link OrcTableSource}.
	 */
	public static class Builder {

		private String[] paths;

		private TypeDescription schema;

		private Configuration config;

		private int batchSize = 0;

		private boolean recursive = true;

        private String rowtimeField;

		/**
		 * Sets the path of the ORC file(s).
		 * If the path specifies a directory, it will be recursively enumerated.
		 *
		 * @param path The path of the ORC file(s).
		 * @return The builder.
		 */
		public Builder path(String path) {
			Preconditions.checkNotNull(path, "Path must not be null.");
			this.paths = path.split(",");
			return this;
		}

        public Builder setRowtimeField(String rowtimeField) {
            Preconditions.checkNotNull(rowtimeField, "Path must not be null.");
            this.rowtimeField = rowtimeField;
            return this;
        }

		/**
		 * Sets the path of the ORC file(s).
		 *
		 * @param path The path of the ORC file(s).
		 * @param recursive Flag whether the to enumerate
		 * @return The builder.
		 */
		public Builder path(String path, boolean recursive) {
			Preconditions.checkNotNull(path, "Path must not be null.");
			this.paths = path.split(",");
			this.recursive = recursive;
			return this;
		}

		/**
		 * Sets the ORC schema of the files to read as a String.
		 *
		 * @param orcSchema The ORC schema of the files to read as a String.
		 * @return The builder.
		 */
		public Builder forOrcSchema(String orcSchema) {
			Preconditions.checkNotNull(orcSchema, "ORC schema must not be null.");
			this.schema = TypeDescription.fromString(orcSchema);
			return this;
		}

		/**
		 * Sets the ORC schema of the files to read as a {@link TypeDescription}.
		 *
		 * @param orcSchema The ORC schema of the files to read as a String.
		 * @return The builder.
		 */
		public Builder forOrcSchema(TypeDescription orcSchema) {
			Preconditions.checkNotNull(orcSchema, "ORC Schema must not be null.");
			this.schema = orcSchema;
			return this;
		}

		/**
		 * Sets a Hadoop {@link Configuration} for the ORC reader. If no configuration is configured,
		 * an empty configuration is used.
		 *
		 * @param config The Hadoop Configuration for the ORC reader.
		 * @return The builder.
		 */
		public Builder withConfiguration(Configuration config) {
			Preconditions.checkNotNull(config, "Configuration must not be null.");
			this.config = config;
			return this;
		}

		/**
		 * Sets the number of rows that are read in a batch. If not configured, the ORC files are
		 * read with a batch size of 1000.
		 *
		 * @param batchSize The number of rows that are read in a batch.
		 * @return The builder.
		 */
		public Builder withBatchSize(int batchSize) {
			Preconditions.checkArgument(batchSize > 0, "Batch size must be greater than zero.");
			this.batchSize = batchSize;
			return this;
		}

		/**
		 * Builds the OrcTableSource for this builder.
		 *
		 * @return The OrcTableSource for this builder.
		 */
		public OrcTableSource build() {
			Preconditions.checkNotNull(this.paths, "Path must not be null.");
			Preconditions.checkNotNull(this.schema, "ORC schema must not be null.");
			if (this.config == null) {
				this.config = new Configuration();
			}
			if (this.batchSize == 0) {
				// set default batch size
				this.batchSize = DEFAULT_BATCH_SIZE;
			}
			Path[] path = new Path[paths.length];
            for (int i = 0; i < paths.length; i++) {
                path[i] = new Path(paths[i]);
            }
            return new OrcTableSource(this.schema, this.config, this.batchSize, this.recursive , rowtimeField, path);
		}

	}

}
  1. ContinuousFileMonitoringFunction
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.flink.streaming.api.functions.source;

import org.apache.flink.annotation.Internal;
import org.apache.flink.annotation.VisibleForTesting;
import org.apache.flink.api.common.io.FileInputFormat;
import org.apache.flink.api.common.io.FilePathFilter;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.common.typeutils.base.LongSerializer;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.core.fs.FileInputSplit;
import org.apache.flink.core.fs.FileStatus;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.core.fs.Path;
import org.apache.flink.runtime.state.FunctionInitializationContext;
import org.apache.flink.runtime.state.FunctionSnapshotContext;
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
import org.apache.flink.util.Preconditions;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.*;

/**
 * This is the single (non-parallel) monitoring task which takes a {@link FileInputFormat}
 * and, depending on the {@link FileProcessingMode} and the {@link FilePathFilter}, it is responsible for:
 *
 * <ol>
 *     <li>Monitoring a user-provided path.</li>
 *     <li>Deciding which files should be further read and processed.</li>
 *     <li>Creating the {@link FileInputSplit splits} corresponding to those files.</li>
 *     <li>Assigning them to downstream tasks for further processing.</li>
 * </ol>
 *
 * <p>The splits to be read are forwarded to the downstream {@link ContinuousFileReaderOperator}
 * which can have parallelism greater than one.
 *
 * <p><b>IMPORTANT NOTE: </b> Splits are forwarded downstream for reading in ascending modification time order,
 * based on the modification time of the files they belong to.
 */
@Internal
public class ContinuousFileMonitoringFunction<OUT>
        extends RichSourceFunction<TimestampedFileInputSplit> implements CheckpointedFunction {

    private static final long serialVersionUID = 1L;

    private static final Logger LOG = LoggerFactory.getLogger(ContinuousFileMonitoringFunction.class);

    /**
     * The minimum interval allowed between consecutive path scans.
     *
     * <p><b>NOTE:</b> Only applicable to the {@code PROCESS_CONTINUOUSLY} mode.
     */
    public static final long MIN_MONITORING_INTERVAL = 1L;

    /** The path to monitor. */
    //private final String path;

    private String[] paths;

    /** The parallelism of the downstream readers. */
    private final int readerParallelism;

    /** The {@link FileInputFormat} to be read. */
    private final FileInputFormat<OUT> format;

    /** The interval between consecutive path scans. */
    private final long interval;

    /** Which new data to process (see {@link FileProcessingMode}. */
    private final FileProcessingMode watchType;

    /** The maximum file modification time seen so far. */
    private volatile long globalModificationTime = Long.MIN_VALUE;

    private transient Object checkpointLock;

    private volatile boolean isRunning = true;

    private transient ListState<Long> checkpointedState;

    public ContinuousFileMonitoringFunction(
            FileInputFormat<OUT> format,
            FileProcessingMode watchType,
            int readerParallelism,
            long interval) {

        Preconditions.checkArgument(
                watchType == FileProcessingMode.PROCESS_ONCE || interval >= MIN_MONITORING_INTERVAL,
                "The specified monitoring interval (" + interval + " ms) is smaller than the minimum " +
                        "allowed one (" + MIN_MONITORING_INTERVAL + " ms)."
        );

      /*Preconditions.checkArgument(
         format.getFilePaths().length == 1,
         "FileInputFormats with multiple paths are not supported yet.");*/

        this.format = Preconditions.checkNotNull(format, "Unspecified File Input Format.");

        paths = new String[format.getFilePaths().length];
        for (int i = 0; i < format.getFilePaths().length; i++) {
            paths[i] = format.getFilePaths()[i].toString();
        }
        //this.path = Preconditions.checkNotNull(format.getFilePaths()[0].toString(), "Unspecified Path.");

        this.interval = interval;
        this.watchType = watchType;
        this.readerParallelism = Math.max(readerParallelism, 1);
        this.globalModificationTime = Long.MIN_VALUE;
    }

    @VisibleForTesting
    public long getGlobalModificationTime() {
        return this.globalModificationTime;
    }

    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {

        Preconditions.checkState(this.checkpointedState == null,
                "The " + getClass().getSimpleName() + " has already been initialized.");

        this.checkpointedState = context.getOperatorStateStore().getListState(
                new ListStateDescriptor<>(
                        "file-monitoring-state",
                        LongSerializer.INSTANCE
                )
        );

        if (context.isRestored()) {
            LOG.info("Restoring state for the {}.", getClass().getSimpleName());

            List<Long> retrievedStates = new ArrayList<>();
            for (Long entry : this.checkpointedState.get()) {
                retrievedStates.add(entry);
            }

            // given that the parallelism of the function is 1, we can only have 1 or 0 retrieved items.
            // the 0 is for the case that we are migrating from a previous Flink version.

            Preconditions.checkArgument(retrievedStates.size() <= 1,
                    getClass().getSimpleName() + " retrieved invalid state.");

            if (retrievedStates.size() == 1 && globalModificationTime != Long.MIN_VALUE) {
                // this is the case where we have both legacy and new state.
                // The two should be mutually exclusive for the operator, thus we throw the exception.

                throw new IllegalArgumentException(
                        "The " + getClass().getSimpleName() + " has already restored from a previous Flink version.");

            } else if (retrievedStates.size() == 1) {
                this.globalModificationTime = retrievedStates.get(0);
                if (LOG.isDebugEnabled()) {
                    LOG.debug("{} retrieved a global mod time of {}.",
                            getClass().getSimpleName(), globalModificationTime);
                }
            }

        } else {
            LOG.info("No state to restore for the {}.", getClass().getSimpleName());
        }
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        format.configure(parameters);

        if (LOG.isDebugEnabled()) {
            LOG.debug("Opened {} (taskIdx= {}) for path: {}",
                    getClass().getSimpleName(), getRuntimeContext().getIndexOfThisSubtask(), paths);
        }
    }

    @Override
    public void run(SourceFunction.SourceContext<TimestampedFileInputSplit> context) throws Exception {
        FileSystem[] fileSystems = new FileSystem[paths.length];
        for (int i = 0; i < paths.length; i++) {
            Path p = new Path(paths[i]);
            fileSystems[i] = FileSystem.get(p.toUri());
            if (!fileSystems[i].exists(p)) {
                throw new FileNotFoundException("The provided file path " + p + " does not exist.");
            }
        }

        checkpointLock = context.getCheckpointLock();
        switch (watchType) {
            case PROCESS_CONTINUOUSLY:
                while (isRunning) {
                    synchronized (checkpointLock) {
                        monitorDirAndForwardSplits(fileSystems, context);
                    }
                    Thread.sleep(interval);
                }

                // here we do not need to set the running to false and the
                // globalModificationTime to Long.MAX_VALUE because to arrive here,
                // either close() or cancel() have already been called, so this
                // is already done.

                break;
            case PROCESS_ONCE:
                synchronized (checkpointLock) {

                    // the following check guarantees that if we restart
                    // after a failure and we managed to have a successful
                    // checkpoint, we will not reprocess the directory.

                    if (globalModificationTime == Long.MIN_VALUE) {
                        monitorDirAndForwardSplits(fileSystems, context);
                        globalModificationTime = Long.MAX_VALUE;
                    }
                    isRunning = false;
                }
                break;
            default:
                isRunning = false;
                throw new RuntimeException("Unknown WatchType" + watchType);
        }
    }

    private void monitorDirAndForwardSplits(FileSystem[] fs,
                                            SourceContext<TimestampedFileInputSplit> context) throws IOException {
        assert (Thread.holdsLock(checkpointLock));

        Map<Path, FileStatus> eligibleFiles = listEligibleFiles(fs, paths);
        Map<Long, List<TimestampedFileInputSplit>> splitsSortedByModTime = getInputSplitsSortedByModTime(eligibleFiles);

        for (Map.Entry<Long, List<TimestampedFileInputSplit>> splits: splitsSortedByModTime.entrySet()) {
            long modificationTime = splits.getKey();
            for (TimestampedFileInputSplit split: splits.getValue()) {
                LOG.info("Forwarding split: " + split);
                context.collect(split);
            }
            // update the global modification time
            globalModificationTime = Math.max(globalModificationTime, modificationTime);
        }
    }

    /**
     * Creates the input splits to be forwarded to the downstream tasks of the
     * {@link ContinuousFileReaderOperator}. Splits are sorted <b>by modification time</b> before
     * being forwarded and only splits belonging to files in the {@code eligibleFiles}
     * list will be processed.
     * @param eligibleFiles The files to process.
     */
    private Map<Long, List<TimestampedFileInputSplit>> getInputSplitsSortedByModTime(
            Map<Path, FileStatus> eligibleFiles) throws IOException {

        Map<Long, List<TimestampedFileInputSplit>> splitsByModTime = new TreeMap<>();
        if (eligibleFiles.isEmpty()) {
            return splitsByModTime;
        }

        for (FileInputSplit split: format.createInputSplits(readerParallelism)) {
            FileStatus fileStatus = eligibleFiles.get(split.getPath());
            if (fileStatus != null) {
                Long modTime = fileStatus.getModificationTime();
                List<TimestampedFileInputSplit> splitsToForward = splitsByModTime.get(modTime);
                if (splitsToForward == null) {
                    splitsToForward = new ArrayList<>();
                    splitsByModTime.put(modTime, splitsToForward);
                }
                splitsToForward.add(new TimestampedFileInputSplit(
                        modTime, split.getSplitNumber(), split.getPath(),
                        split.getStart(), split.getLength(), split.getHostnames()));
            }
        }
        return splitsByModTime;
    }

    /**
     * Returns the paths of the files not yet processed.
     * @param fileSystems The filesystem where the monitored directory resides.
     */
    private Map<Path, FileStatus> listEligibleFiles(FileSystem[] fileSystems, String[] paths) throws IOException {
        // handle the new files
        Map<Path, FileStatus> files = new HashMap<>();
        try {
            for (int i = 0; i < fileSystems.length; i++) {
                FileSystem fileSystem = fileSystems[i];
                Path path = new Path(paths[i]);
                FileStatus[] statuses = fileSystem.listStatus(path);

                if (statuses == null) {
                    LOG.warn("Path does not exist: {}", path);
                    return Collections.emptyMap();
                } else {
                    for (FileStatus status : statuses) {
                        if (!status.isDir()) {
                            Path filePath = status.getPath();
                            long modificationTime = status.getModificationTime();
                            if (!shouldIgnore(filePath, modificationTime)) {
                                files.put(filePath, status);
                            }
                        } else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
                            FileSystem[] fs = new FileSystem[]{fileSystem};
                            String[] p = new String[]{status.getPath().toString()};
                            files.putAll(listEligibleFiles(fs, p));
                        }
                    }
                }
            }
            return files;
        } catch (IOException e) {
            // we may run into an IOException if files are moved while listing their status
            // delay the check for eligible files in this case
            return Collections.emptyMap();
        }
    }

    /**
     * Returns {@code true} if the file is NOT to be processed further.
     * This happens if the modification time of the file is smaller than
     * the {@link #globalModificationTime}.
     * @param filePath the path of the file to check.
     * @param modificationTime the modification time of the file.
     */
    private boolean shouldIgnore(Path filePath, long modificationTime) {
        assert (Thread.holdsLock(checkpointLock));
        boolean shouldIgnore = modificationTime <= globalModificationTime;
        if (shouldIgnore && LOG.isDebugEnabled()) {
            LOG.debug("Ignoring " + filePath + ", with mod time= " + modificationTime +
                    " and global mod time= " + globalModificationTime);
        }
        return shouldIgnore;
    }

    @Override
    public void close() throws Exception {
        super.close();

        if (checkpointLock != null) {
            synchronized (checkpointLock) {
                globalModificationTime = Long.MAX_VALUE;
                isRunning = false;
            }
        }

        if (LOG.isDebugEnabled()) {
            LOG.debug("Closed File Monitoring Source for path: " + paths + ".");
        }
    }

    @Override
    public void cancel() {
        if (checkpointLock != null) {
            // this is to cover the case where cancel() is called before the run()
            synchronized (checkpointLock) {
                globalModificationTime = Long.MAX_VALUE;
                isRunning = false;
            }
        } else {
            globalModificationTime = Long.MAX_VALUE;
            isRunning = false;
        }
    }

    //   ---------------------         Checkpointing        --------------------------

    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        Preconditions.checkState(this.checkpointedState != null,
                "The " + getClass().getSimpleName() + " state has not been properly initialized.");

        this.checkpointedState.clear();
        this.checkpointedState.add(this.globalModificationTime);

        if (LOG.isDebugEnabled()) {
            LOG.debug("{} checkpointed {}.", getClass().getSimpleName(), globalModificationTime);
        }
    }
}

使用

OrcTableSource tableSrc = OrcTableSource.builder()
                .path("/path/to/data1,/path/to/data2")
                .setRowtimeField("row_time")
                .forOrcSchema(schema)
                .build();
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
我可以回答这个问题。以下是一个Java实现Flink读取HDFS下多目录文件的例子: ``` import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.api.java.DataSet; import org.apache.flink.api.java.ExecutionEnvironment; import org.apache.flink.api.java.io.TextInputFormat; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.core.fs.Path; import org.apache.flink.util.Collector; public class FlinkHDFSExample { public static void main(String[] args) throws Exception { final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> text = env.readTextFile("hdfs://localhost:9000/path/to/directory1/,hdfs://localhost:9000/path/to/directory2/") .withParameters(new Configuration().setBoolean("recursive.file.enumeration", true)); DataSet<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()) .groupBy(0) .sum(1); counts.print(); } public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { String[] tokens = value.toLowerCase().split("\\W+"); for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<>(token, 1)); } } } } } ``` 这个例子使用Flink的`readTextFile`方法读取HDFS下的多个目录中的文件,并使用`Tokenizer`函数对文件进行分词,最后统计每个单词出现的次数。注意,需要在`readTextFile`方法中设置`recursive.file.enumeration`参数为`true`,以便递归地读取所有子目录中的文件
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值