Spark读取HBase数据源

最新推荐文章于 2024-01-20 12:36:48 发布

AlferWei

最新推荐文章于 2024-01-20 12:36:48 发布

阅读量2.8k

点赞数 1

分类专栏： Spark HBase

本文链接：https://blog.csdn.net/OiteBody/article/details/80187848

版权

读取HDFS相关的数据源时，大量使用mapreduce封装的读取数据源的方式，而一个mapreduce job会依赖InputFormat对读取的数据进行格式校验、输入切分等操作。读取HBase数据源，则使用了TableInputFormat。先来看看InputFormat。

InputFormat

InputFormat是mapreduce提供的数据源格式接口，也就是说，通过该接口可以支持读取各种各样的数据源（文件系统，数据库等），从而进行mapreduce计算。

看下InputFormat接口定义：

public abstract class InputFormat<K, V> {

  /** 
   * Logically split the set of input files for the job.  
   * 
   * @param context job configuration.
   * @return an array of {@link InputSplit}s for the job.
   */
  public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
  
  /**
   * Create a record reader for a given split. The framework will call
   * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
   * the split is used.
   */
  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

}

getSplits决定逻辑分区的策略，createRecordReader提供了获取切分后分区记录的迭代器。

TableInputFormat

TalbeInputFormat是HBase提供的接口，看看他的分区策略：

RegionSizeCalculator sizeCalculator =
    new RegionSizeCalculator(getRegionLocator(), getAdmin());
T

最低0.47元/天解锁文章

AlferWei

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录