HBase的几种导入数据的方式

最新推荐文章于 2024-07-18 15:54:06 发布

红豆和绿豆

最新推荐文章于 2024-07-18 15:54:06 发布

阅读量5.2k

点赞数

分类专栏： hadoop 文章标签： HBase load data

本文链接：https://blog.csdn.net/u011955252/article/details/50527678

版权

hadoop 专栏收录该内容

93 篇文章 2 订阅

订阅专栏

1、传统的主要使用Hbase的shell进行手动的输入，都需要经过Hbase的接口，过程

2、使用MapReduce进行批量的导入，但是还是会经过Hbase的HMaster，HregionerServer一些列的过程，增加系统的资源的消耗。例如：

import java.text.SimpleDateFormat;

public class BatchImport {

//数据的形式类似于 0 20101223122329大叫好
static class BatchImportMapper extends Mapper<LongWritable, Text, LongWritable, Text>{
SimpleDateFormat dateformat1=new SimpleDateFormat("yyyyMMddHHmmss");
Text v2 = new Text();

protected void map(LongWritable key, Text value, Context context) throws java.io.IOException ,InterruptedException {
final String[] splited = value.toString().split("\t");
try {
final Date date = new Date(Long.parseLong(splited[0].trim()));
final String dateFormat = dateformat1.format(date);

v2.set(splited[1]+":"+dateFormat+"\t"+value.toString());
context.write(key, v2);
} catch (NumberFormatException e) {
final Counter counter = context.getCounter("BatchImport", "ErrorFormat");
counter.increment(1L);
System.out.println("出错了"+splited[0]+" "+e.getMessage());
}
};
}

static class BatchImportReducer extends TableReducer<LongWritable, Text, NullWritable>{
protected void reduce(LongWritable key, java.lang.Iterable<Text> values,Context context) throws java.io.IOException ,InterruptedException {
for (Text text : values) {
final String[] splited = text.toString().split("\t");

final Put put = new Put(Bytes.toBytes(splited[0]));
put.add(Bytes.toBytes("cf"), Bytes.toBytes("date"), Bytes.toBytes(splited[1]));
//省略其他字段，调用put.add(....)即可
context.write(NullWritable.get(), put);
}
};
}

public static void main(String[] args) throws Exception {
final Configuration configuration = new Configuration();
//设置zookeeper
configuration.set("hbase.zookeeper.quorum", "master");
//设置hbase表名称
configuration.set(TableOutputFormat.OUTPUT_TABLE, "wlan_log");
//将该值改大，防止hbase超时退出
configuration.set("dfs.socket.timeout", "180000");

final Job job = new Job(configuration, "HBaseBatchImport");

job.setMapperClass(BatchImportMapper.class);
job.setReducerClass(BatchImportReducer.class);
//设置map的输出，不设置reduce的输出类型
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);

job.setInputFormatClass(TextInputFormat.class);
//不再设置输出路径，而是设置输出格式类型
job.setOutputFormatClass(TableOutputFormat.class);

FileInputFormat.setInputPaths(job, "hdfs://222.27.174.66:9000/input2");

job.waitForCompletion(true);
}
}

3、不经过Hbase的过程，直接在HDFS中生成HFile，在将HFile更新到相应的HReginServer中

可以使用命令的方式，将hdfs文件转化为hfile

首先创建一个表 create 'datatsv' ,'d'

创建一个文件inputfile

row1 c1 c2

row2 c1 c2

row3 c1 c2

row4 c1 c2

row5 c1 c2

row6 c1 c2

row7 c1 c2

row8 c1 c2

row9 c1 c2

hadoop fs -put inputfile /user/input/inputfile //上传到hdfs的目录下

hadoop jar hbase-0.94.7.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=/user/output/outputfile datatsv /user/input/inputfile

在/user/output/outputfile目录下生成HFile文件

将HFile导入到HBAse中

bin/hbaseorg.apache.hadoop.hbase.mapreduce.LoadIncreamentalHFiles /user/output/outputfile datatsv

一、这种方式有很多的优点：

1. 如果我们一次性入库hbase巨量数据，处理速度慢不说，还特别占用Region资源，一个比较高效便捷的方法就是使用 “Bulk Loading”方法，即HBase提供的HFileOutputFormat类。

2. 它是利用hbase的数据信息按照特定格式存储在hdfs内这一原理，直接生成这种hdfs内存储的数据格式文件，然后上传至合适位置，即完成巨量数据快速入库的办法。配合mapreduce完成，高效便捷，而且不占用region资源，增添负载。

二、这种方式也有很大的限制：

1. 仅适合初次数据导入，即表内数据为空，或者每次入库表内都无数据的情况。

三、接下来一个demo，简单介绍整个过程。

1. 生成HFile部分

 
        package  
        zl.hbase.mr; 
       
        import  
        java.io.IOException; 
       
        import  
        org.apache.hadoop.conf.Configuration; 
       
        import  
        org.apache.hadoop.fs.Path; 
       
        import  
        org.apache.hadoop.hbase.KeyValue; 
       
        import  
        org.apache.hadoop.hbase.io.ImmutableBytesWritable; 
       
        import  
        org.apache.hadoop.hbase.mapreduce.HFileOutputFormat; 
       
        import  
        org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer; 
       
        import  
        org.apache.hadoop.hbase.mapreduce.SimpleTotalOrderPartitioner; 
       
        import  
        org.apache.hadoop.hbase.util.Bytes; 
       
        import  
        org.apache.hadoop.io.LongWritable; 
       
        import  
        org.apache.hadoop.io.Text; 
       
        import  
        org.apache.hadoop.mapreduce.Job; 
       
        import  
        org.apache.hadoop.mapreduce.Mapper; 
       
        import  
        org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
       
        import  
        org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
       
        import  
        org.apache.hadoop.util.GenericOptionsParser; 
       
        import  
        zl.hbase.util.ConnectionUtil; 
       
        public  
        class  
        HFileGenerator { 
       
        public  
        static  
        class  
        HFileMapper 
        extends 
       
        Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> { 
       
        @Override 
       
        protected  
        void  
        map(LongWritable key, Text value, Context context) 
       
        throws  
        IOException, InterruptedException { 
       
        String line = value.toString(); 
       
        String[] items = line.split( 
        "," 
        , - 
        1 
        ); 
       
        ImmutableBytesWritable rowkey =  
        new  
        ImmutableBytesWritable( 
       
        items[ 
        0 
        ].getBytes()); 
       
        KeyValue kv =  
        new  
        KeyValue(Bytes.toBytes(items[ 
        0 
        ]), 
       
        Bytes.toBytes(items[ 
        1 
        ]), Bytes.toBytes(items[ 
        2 
        ]), 
       
        System.currentTimeMillis(), Bytes.toBytes(items[ 
        3 
        ])); 
       
        if  
        ( 
        null  
        != kv) { 
       
        context.write(rowkey, kv); 
       
        } 
       
        } 
       
        } 
       
        public  
        static  
        void  
        main(String[] args)  
        throws  
        IOException, 
       
        InterruptedException, ClassNotFoundException { 
       
        Configuration conf =  
        new  
        Configuration(); 
       
        String[] dfsArgs =  
        new  
        GenericOptionsParser(conf, args) 
       
        .getRemainingArgs(); 
       
        Job job =  
        new  
        Job(conf, 
        "HFile bulk load test" 
        ); 
       
        job.setJarByClass(HFileGenerator. 
        class 
        ); 
       
        job.setMapperClass(HFileMapper. 
        class 
        ); 
       
        job.setReducerClass(KeyValueSortReducer. 
        class 
        ); 
       
        job.setMapOutputKeyClass(ImmutableBytesWritable. 
        class 
        ); 
       
        job.setMapOutputValueClass(Text. 
        class 
        ); 
       
        job.setPartitionerClass(SimpleTotalOrderPartitioner. 
        class 
        ); 
       
        FileInputFormat.addInputPath(job, 
        new  
        Path(dfsArgs[ 
        0 
        ])); 
       
        FileOutputFormat.setOutputPath(job, 
        new  
        Path(dfsArgs[ 
        1 
        ])); 
       
        HFileOutputFormat.configureIncrementalLoad(job, 
       
        ConnectionUtil.getTable()); 
       
        System.exit(job.waitForCompletion( 
        true 
        ) ?  
        0  
        : 
        1 
        ); 
       
        } 
       
        }

生成HFile程序说明：

①. 最终输出结果，无论是map还是reduce，输出部分key和value的类型必须是： < ImmutableBytesWritable, KeyValue>或者< ImmutableBytesWritable, Put>。

②. 最终输出部分，Value类型是KeyValue 或Put，对应的Sorter分别是KeyValueSortReducer或PutSortReducer。

③. MR例子中job.setOutputFormatClass(HFileOutputFormat.class); HFileOutputFormat只适合一次对单列族组织成HFile文件。

④. MR例子中HFileOutputFormat.configureIncrementalLoad(job, table);自动对job进行配置。SimpleTotalOrderPartitioner是需要先对key进行整体排序，然后划分到每个reduce中，保证每一个reducer中的的key最小最大值区间范围，是不会有交集的。因为入库到HBase的时候，作为一个整体的Region，key是绝对有序的。

⑤. MR例子中最后生成HFile存储在HDFS上，输出路径下的子目录是各个列族。如果对HFile进行入库HBase，相当于move HFile到HBase的Region中，HFile子目录的列族内容没有了。

2. HFile入库到HBase

 
        package  
        zl.hbase.bulkload; 
       
        import  
        org.apache.hadoop.fs.Path; 
       
        import  
        org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles; 
       
        import  
        org.apache.hadoop.util.GenericOptionsParser; 
       
        import  
        zl.hbase.util.ConnectionUtil; 
       
        public  
        class  
        HFileLoader { 
       
        public  
        static  
        void  
        main(String[] args)  
        throws  
        Exception { 
       
        String[] dfsArgs =  
        new  
        GenericOptionsParser( 
       
        ConnectionUtil.getConfiguration(), args).getRemainingArgs(); 
       
        LoadIncrementalHFiles loader =  
        new  
        LoadIncrementalHFiles( 
       
        ConnectionUtil.getConfiguration()); 
       
        loader.doBulkLoad( 
        new  
        Path(dfsArgs[ 
        0 
        ]), ConnectionUtil.getTable()); 
       
        } 
       
        }

通过HBase中 LoadIncrementalHFiles的doBulkLoad方法，对生成的HFile文件入库

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;

public class ConnectionUtil {
private static final String TABLE_NAME="yy";

public static HTable getTable() {
HTable hTable=null;
try {
if(hTable==null){
hTable= new HTable(getConfiguration(), TABLE_NAME);
}
} catch (IOException e1) {
e1.printStackTrace();
}

return hTable;
}

public static Configuration getConfiguration() {
Configuration conf=HBaseConfiguration.create();
return conf;
}

}