bulkload批量加载数据到HBase

最新推荐文章于 2024-07-03 20:49:24 发布

青眼酷白龙

最新推荐文章于 2024-07-03 20:49:24 发布

阅读量204

点赞数

分类专栏： HBase 文章标签： hadoop HBase

本文链接：https://blog.csdn.net/qq_36382679/article/details/107324202

版权

HBase 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

bulkload批量加载数据到HBase模板代码

需求：将hdfs上/hbase/input/user.txt的数据文件，转换成HFile格式，然后load到myuser2这张表里面去。

首先通过MapReduce读取hdfs上数据，把数据输出成为Hfile格式文件

1 LoadMapper

public class LoadMapper  extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>{
    @Override
    protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        //创建put对象 指定rowkey
        Put put = new Put(Bytes.toBytes(split[0]));
        //添加列族，列信息
        put.addColumn("f1".getBytes(),"name".getBytes(),split[1].getBytes());
        put.addColumn("f1".getBytes(),"age".getBytes(),Bytes.toBytes(Integer.parseInt(split[2])));
        //输出
        context.write(new ImmutableBytesWritable(Bytes.toBytes(split[0])),put);
    }
}

2 HBaseLoad

public class HBaseLoad {
    public static void main(String[] args) throws Exception{
        final String INPUT_PATH= "hdfs://node-1:8020/hbase/input";
        final String OUTPUT_PATH= "hdfs://node-1:8020/hbase/output_hfile";

        Configuration conf = HBaseConfiguration.create();
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("myuser2"));

        Job job= Job.getInstance(conf);
        job.setJarByClass(HBaseLoad.class);
        job.setMapperClass(LoadMapper.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);

        //指定输出的类为HFileOutputFormat2
        job.setOutputFormatClass(HFileOutputFormat2.class);
        HFileOutputFormat2.configureIncrementalLoad(job,table,connection.getRegionLocator(TableName.valueOf("myuser2")));
        //设定输入输出路径
        FileInputFormat.addInputPath(job,new Path(INPUT_PATH));
        FileOutputFormat.setOutputPath(job,new Path(OUTPUT_PATH));

        boolean b = job.waitForCompletion(true);
        System.exit(b?0 :1);
    }
}

3 加载数据LoadData

public class LoadData {
    public static void main(String[] args) throws Exception {
        Configuration configuration = HBaseConfiguration.create();
        configuration.set("hbase.zookeeper.property.clientPort", "2181");
        configuration.set("hbase.zookeeper.quorum", "node-1,node-2,node-3");

        Connection connection =  ConnectionFactory.createConnection(configuration);
        Admin admin = connection.getAdmin();
        Table table = connection.getTable(TableName.valueOf("myuser2"));
        LoadIncrementalHFiles load = new LoadIncrementalHFiles(configuration);
        load.doBulkLoad(new Path("hdfs://node-1:8020/hbase/output_hfile"), admin,table,connection.getRegionLocator(TableName.valueOf("myuser2")));
    }
}

4 HFileOutputFormat2里面封装了reduce阶段

Using this class as part of a MapReduce job is best done
using {@link #configureIncrementalLoad(Job, Table, RegionLocator)}.

/**
   * Configure a MapReduce Job to perform an incremental load into the given
   * table. This
   * <ul>
   *   <li>Inspects the table to configure a total order partitioner</li>
   *   <li>Uploads the partitions file to the cluster and adds it to the DistributedCache</li>
   *   <li>Sets the number of reduce tasks to match the current number of regions</li>
   *   <li>Sets the output key/value class to match HFileOutputFormat2's requirements</li>
   *   <li>Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or
   *     PutSortReducer)</li>
   * </ul>
   * The user should be sure to set the map output value class to either KeyValue or Put before
   * running this function.
   */
  public static void configureIncrementalLoad(Job job, Table table, RegionLocator regionLocator)
      throws IOException {
    configureIncrementalLoad(job, table.getTableDescriptor(), regionLocator);
  }

/**
   * Configure a MapReduce Job to perform an incremental load into the given
   * table. This
   * <ul>
   *   <li>Inspects the table to configure a total order partitioner</li>
   *   <li>Uploads the partitions file to the cluster and adds it to the DistributedCache</li>
   *   <li>Sets the number of reduce tasks to match the current number of regions</li>
   *   <li>Sets the output key/value class to match HFileOutputFormat2's requirements</li>
   *   <li>Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or
   *     PutSortReducer)</li>
   * </ul>
   * The user should be sure to set the map output value class to either KeyValue or Put before
   * running this function.
   */
  public static void configureIncrementalLoad(Job job, HTableDescriptor tableDescriptor,
      RegionLocator regionLocator) throws IOException {
    configureIncrementalLoad(job, tableDescriptor, regionLocator, HFileOutputFormat2.class);
  }

static void configureIncrementalLoad(Job job, HTableDescriptor tableDescriptor,
      RegionLocator regionLocator, Class<? extends OutputFormat<?, ?>> cls) throws IOException,
      UnsupportedEncodingException {
    Configuration conf = job.getConfiguration();
    job.setOutputKeyClass(ImmutableBytesWritable.class);
    job.setOutputValueClass(KeyValue.class);
    job.setOutputFormatClass(cls);

    // Based on the configured map output class, set the correct reducer to properly
    // sort the incoming values.
    // TODO it would be nice to pick one or the other of these formats.
    if (KeyValue.class.equals(job.getMapOutputValueClass())) {
      job.setReducerClass(KeyValueSortReducer.class);
    } else if (Put.class.equals(job.getMapOutputValueClass())) {
      job.setReducerClass(PutSortReducer.class);
    } else if (Text.class.equals(job.getMapOutputValueClass())) {
      job.setReducerClass(TextSortReducer.class);
    } else {
      LOG.warn("Unknown map output value type:" + job.getMapOutputValueClass());
    }

    conf.setStrings("io.serializations", conf.get("io.serializations"),
        MutationSerialization.class.getName(), ResultSerialization.class.getName(),
        KeyValueSerialization.class.getName());

    // Use table's region boundaries for TOP split points.
    LOG.info("Looking up current regions for table " + tableDescriptor.getTableName());
    List<ImmutableBytesWritable> startKeys = getRegionStartKeys(regionLocator);
    LOG.info("Configuring " + startKeys.size() + " reduce partitions " +
        "to match current region count");
    job.setNumReduceTasks(startKeys.size());

    configurePartitioner(job, startKeys);
    // Set compression algorithms based on column families
    configureCompression(conf, tableDescriptor);
    configureBloomType(tableDescriptor, conf);
    configureBlockSize(tableDescriptor, conf);
    configureDataBlockEncoding(tableDescriptor, conf);

    TableMapReduceUtil.addDependencyJars(job);
    TableMapReduceUtil.initCredentials(job);
    LOG.info("Incremental table " + regionLocator.getName() + " output configured.");
  }

根据getMapOutPutValueClass的类选择合适的reduce的类进行排序

if (KeyValue.class.equals(job.getMapOutputValueClass())) {
job.setReducerClass(KeyValueSortReducer.class);
} else if (Put.class.equals(job.getMapOutputValueClass())) {
job.setReducerClass(PutSortReducer.class);
} else if (Text.class.equals(job.getMapOutputValueClass())) {
job.setReducerClass(TextSortReducer.class);
} else {
LOG.warn(“Unknown map output value type:” + job.getMapOutputValueClass());
}

青眼酷白龙

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
bulkload批量加载数据到HBase

bulkload批量加载数据到HBase模板代码需求：将hdfs上/hbase/input/user.txt的数据文件，转换成HFile格式，然后load到myuser2这张表里面去。首先通过MapReduce读取hdfs上数据，把数据输出成为Hfile格式文件1 LoadMapperpublic class LoadMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put>{ @Override
复制链接

扫一扫