HBase BulkLoading

最新推荐文章于 2024-07-19 22:28:09 发布

今天该取什么名字好

最新推荐文章于 2024-07-19 22:28:09 发布

阅读量674

点赞数

分类专栏： hbase 文章标签： hbase hdfs

本文链接：https://blog.csdn.net/zzds111/article/details/121730099

版权

hbase 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

参考hive，如果指定文件在hdfs上的存储路径，已经有文件就会自动加载到表中；

hbase的storefile也会在hdfs存储，那么我们应该也可以利用这一点，将数据以这种方式导入到hbase中，但是存放文件的目录不能像hive一样随便，应该是放在

、

一、定义

HBase BulkLoading：

它是利用hbase的数据信息按照特定格式存储在hdfs内这一原理，直接生成这种hdfs内存储的数据格式文件，然后上传至合适位置，即完成巨量数据快速入库的办法。配合mapreduce完成，高效便捷，而且不占用region资源，增添负载

二、限制

1、适合初次数据导入，即表内数据为空，或者每次入库表内都无数据的情况。

2、HBase集群 Hadoop集群为同一集群，即HBase所基于的HDFS为生成HFile的MR的集群

三、实现

我们现在要做的就是将hdfs上的一个.txt文件读取数据后，处理成hfile再传入hbase 中

这是来自configureIncrementalLoad()的一段注释，里面提到了很多接下来我们写mapreduce程序应该注意的地方

Configure a MapReduce Job to perform an incremental load into the given table. This. Inspects the table to configure a total order partitioner

. Uploads the partitions file to the cluster and adds it to the DistributedCache

. Sets the number of reduce tasks to match the current number of regions

. Sets the output key/value class to match HFileOutputFormat2's requirements

. Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer orPutSortReducer)

The user should be sure to set the map output value class to either KeyValue or Put before runningthis function.

翻译：

配置MapReduce job以对给定表执行增量加载。这检查表的分区排序

. 将分区文件上载到集群并将其添加到DistributedCache

. 设置reduce任务的数量以匹配当前region的数量

（所以我们设置Reduce的数量是没有用的）

. 设置输出键/值类以匹配HFileOutputFormat2的要求

（HFileOutputFormat2是将输出的数据转化成HFile）

. 设置 reduce执行适当的排序（ KeyValueSortReducer or PutSortReducer）

这里提到了reduce排序，没错，我们在driver端就只需要对reduce进行一个排序就可以了，并且这是针对reducer内部做的排序

在运行此函数之前，用户应确保将映射输出值Key的类型为KeyValue或Put。

这是里说的就是我们写map端的输出类型中，key应该是为KeyValue或Put类。

在hbase中创建一张分区表来接收数据

 create 'dianxin_bulk','info',{SPLITS=>['1|','3|','5|','7|','9|','B|','D|']}

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.hbase.mapreduce.SimpleTotalOrderPartitioner;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class BulkLoading {
    //讲一下k-v输出类型，限定Value的类型是KeyValue/Put，这里我就使用KeyValue
    //Key的类型应该是对应hbase中的Rowkey，所以使用ImmutableBytesWritable
    public static class  BulkloadMapper extends Mapper<LongWritable,Text,ImmutableBytesWritable,KeyValue> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] splits = value.toString().split(",");
            String mdn = splits[0];
            String start_time = splits[1];
            // 经度
            String longitude = splits[4];
            // 维度
            String latitude = splits[5];

            String rowkey = mdn + "_" + start_time;

            //想当于写一个cell
            KeyValue lgKV = new KeyValue(rowkey.getBytes(), "info".getBytes(), "lg".getBytes(), longitude.getBytes());
            KeyValue latKV = new KeyValue(rowkey.getBytes(), "info".getBytes(), "lat".getBytes(), latitude.getBytes());
            
            //由于写的是2个列的数据，所以write也是分成了两个
            context.write(new ImmutableBytesWritable(rowkey.getBytes()),lgKV);
            context.write(new ImmutableBytesWritable(rowkey.getBytes()),latKV);
        }
    }

//没有用到需要计算的情况，所以没有写Reducer

    public static void main(String[] args) throws Exception {

        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum","master:2181,node1:2181,node2:2181");

        Job job = Job.getInstance(conf);
        job.setJobName("BulkLoading");
        job.setJarByClass(BulkLoading.class);

        //配置map
        job.setMapperClass(BulkloadMapper.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(KeyValue.class);

        //保证每个Reducer内的数据有序，不会出现重复
        job.setPartitionerClass(SimpleTotalOrderPartitioner.class);

        //配置Reduce
        //保证Reduce内的数据是有序的
        job.setReducerClass(KeyValueSortReducer.class);

        //设置reduce个数是不会生效的，他会和region的数量相等
        job.setNumReduceTasks(3);

        //输入路径
        Path inputPath = new Path("/file/DIANXIN.csv");
        FileInputFormat.addInputPath(job,inputPath);
        
        //输出路径
        Path outputPath = new Path("/output");
        FileSystem fs = FileSystem.get(conf);
        if (fs.exists(outputPath)) {
            fs.delete(outputPath, true);
        }
        FileOutputFormat.setOutputPath(job, outputPath);


        Connection conn = ConnectionFactory.createConnection(conf);
        Admin admin = conn.getAdmin();
        Table dianxin_bulk = conn.getTable(TableName.valueOf("dianxin_bulk"));

        RegionLocator locator = conn.getRegionLocator(TableName.valueOf("dianxin_bulk"));

        // 使用HFileOutputFormat2将输出的数据按照HFile的形式格式化,所以在output路径下的文件就消失了
        HFileOutputFormat2.configureIncrementalLoad(
                job,
                dianxin_bulk,
                locator
        );

        boolean flag = job.waitForCompletion(true);
        if (flag) {
            // 第二步 加载HFile到HBase中
            LoadIncrementalHFiles load = new LoadIncrementalHFiles(conf);
            load.doBulkLoad(
                    outputPath,
                    admin,
                    dianxin_bulk,
                    locator
            );
        } else {
            System.out.println("MR任务运行失败");
        }
    }

 }

我们在建表的时候，创建了8个空的region，这里的reduce task数量是8，我们上面设置reducetasks的数量并没有生效，也验证了reduce 的数量是根据region数量来确定的

到16010端口查看我们指定的表，可以看到我们的数据8次就写完了，效率非常高

今天该取什么名字好

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HBase BulkLoading

参考hive，如果指定文件在hdfs上的存储路径，已经有文件就会自动加载到表中；hbase的storefile也会在hdfs存储，那么我们应该也可以利用这一点，将数据以这种方式导入到hbase中，但是存放文件的目录不能像hive一样随便，应该是放在、一、定义HBase BulkLoading：它是利用hbase的数据信息按照特定格式存储在hdfs内这一原理，直接生成这种hdfs内存储的数据格式文件，然后上传至合适位置，即完成巨量数据快速入库的办法。配合mapreduce完成，高效便..
复制链接

扫一扫

专栏目录