读取HBase表数据写入HBase模板代码
需求:读取HBase当中一张表的数据,然后将数据写入到HBase当中的另外一张表当中去。
这里将myuser这张表当中f1列族的name和age字段写入到myuser2这张表的f1列族当中去
1Maven工程的pom.xml
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0-mr1-cdh5.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.0-cdh5.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.0-cdh5.14.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
<!--<verbal>true</verbal>-->
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*/RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
2 TableMapper
org.apache.hadoop.hbase.mapreduce
Class TableMapper<KEYOUT,VALUEOUT>
java.lang.Object
org.apache.hadoop.mapreduce.Mapper<ImmutableBytesWritable,Result,KEYOUT,VALUEOUT>
org.apache.hadoop.hbase.mapreduce.TableMapper<KEYOUT,VALUEOUT>
Type Parameters:
KEYOUT - The type of the key.
VALUEOUT - The type of the value.
在自定义TableMapper时需要指定两个泛型,即K2和V2的数据类型。K1默认为ImmutableBytesWritable即行键(从hbase中读取到是字节数组),V2默认为Result类型,也就是说一行可能会Get到多个列,这和Hbase的按列存储的存储方式相同,通过Mapper获取到表的数据可以看成在hbase shell中scan得到的结果是类似的,可以说普通的Mapper会对每个kv对调用一次map()函数,而TableMapper是对每列调用一次map()函数(k为行键,v为单元格)
2.1 ImmutableBytesWritable
org.apache.hadoop.hbase.io
Class ImmutableBytesWritable
java.lang.Object
org.apache.hadoop.hbase.io.ImmutableBytesWritable
A byte sequence that is usable as a key or value. Based on BytesWritable
only this class is NOT resizable and DOES NOT distinguish between the size of the sequence and the current capacity as BytesWritable
does. Hence its comparatively ‘immutable’. When creating a new instance of this class, the underlying byte [] is not copied, just referenced. The backing buffer is accessed when we go to serialize.
有道翻译
可用作键或值的字节序列。基于’ BytesWritable ‘,只有这个类是不能调整大小的,并且不能像’ BytesWritable '那样区分序列的大小和当前容量。因此它相对来说是“不可变的”。在创建这个类的新实例时,不会复制底层字节[],只会引用它。当我们进行序列化时,会访问备份缓冲区。
2.2 Result
org.apache.hadoop.hbase.client
Class Result
java.lang.Object
org.apache.hadoop.hbase.client.Result
Single row result of a Get or Scan query.
This class is NOT THREAD SAFE.
Convenience methods are available that return various Map structures and values directly.
To get a complete mapping of all cells in the Result, which can include multiple families and multiple versions, use getMap().
To get a mapping of each family to its columns (qualifiers and values), including only the latest version of each, use getNoVersionMap(). To get a mapping of qualifiers to latest values for an individual family use getFamilyMap(byte[]).
To get the latest value for a specific family and qualifier use getValue(byte[], byte[]). A Result is backed by an array of Cell objects, each representing an HBase cell defined by the row, family, qualifier, timestamp, and value.
The underlying Cell objects can be accessed through the method listCells(). This will create a List from the internal Cell []. Better is to exploit the fact that a new Result instance is a primed CellScanner; just call advance() and current() to iterate over Cells as you would any CellScanner. Call cellScanner() to reset should you need to iterate the same Result over again (CellScanners are one-shot). If you need to overwrite a Result with another Result instance -- as in the old 'mapred' RecordReader next invocations -- then create an empty Result with the null constructor and in then use copyFrom(Result)
有道翻译
获取或扫描查询的单行结果。
这个类不是线程安全的。
有一些方便的方法可以直接返回各种映射结构和值。
要获得结果中所有单元格的完整映射(可以包括多个家族和多个版本),可以使用getMap()。
要获得每个家族到其列(限定符和值)的映射,只包括每个家族的最新版本,可以使用getNoVersionMap()要获取单个家族的限定符到最新值的映射,请使用getFamilyMap(byte[])。
使用==getValue(byte[], byte[])==获取特定家族和限定符的最新值。结果由一组单元格对象支持,每个对象表示由行、族、限定符、时间戳和值定义的HBase单元格。
底层单元格对象可以通过方法listCells()访问。这将从内部单元格[]创建一个列表。更好的方法是利用一个新的结果实例是一个primed CellScanner的事实;只需调用advance()和current()在单元格上进行迭代,就像使用任何CellScanner一样。如果需要再次迭代相同的结果(cellScanner是一次性的),则调用cellScanner()进行重置。如果需要用另一个Result实例覆盖一个结果——就像在旧的“mapred”RecordReader下一个调用中那样——那么使用null构造函数创建一个空结果,然后使用copyFrom(Result)
3 TableReducer
org.apache.hadoop.hbase.mapreduce
Class TableReducer<KEYIN,VALUEIN,KEYOUT>
java.lang.Object
org.apache.hadoop.mapreduce.Reducer<KEYIN,VALUEIN,KEYOUT,Mutation>
org.apache.hadoop.hbase.mapreduce.TableReducer<KEYIN,VALUEIN,KEYOUT>
Type Parameters:
KEYIN - The type of the input key.
VALUEIN - The type of the input value.
KEYOUT - The type of the output key.
Extends the basic Reducer class to add the required key and value input/output classes. While the input key and value as well as the output key can be anything handed in from the previous map phase the output value must be either a Put or a Delete instance when using the TableOutputFormat class.
This class is extended by IdentityTableReducer but can also be subclassed to implement similar features or any custom code needed. It has the advantage to enforce the output value to a specific basic type.
有道翻译
扩展基本的Reducer类以添加所需的键和值输入/输出类。虽然输入键和值以及输出键可以是上一个map阶段提交的任何东西,但在使用TableOutputFormat类时,输出值必须是Put或Delete实例。
该类由IdentityTableReducer扩展,但也可以被子类化以实现类似的特性或任何需要的定制代码。它的优点是将输出值强制为特定的基本类型。
4 HBase与MapReduce集成的java代码
4.1HBaseMapper
public class HBaseMapper extends TableMapper<Text, Put> {
/**
* @param key 主键rowkey
* @param value 一行数据所有列的值都封装在value
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
//获取rowKey
String rowkey = Bytes.toString(key.get());
//构建put对象 指定rowkey
Put put = new Put(key.get());
//解析value数据 也就是该rowkey对应的一行数据
Cell[] cells = value.rawCells();
for (Cell cell : cells) {
//判断列族是否为f1
if("f1".equals(Bytes.toString(CellUtil.cloneFamily(cell)))){
//判断列是否为name,age 如果是添加至put对象
if("name".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))){
put.add(cell);
}
if("age".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))){
put.add(cell);
}
}
}
//如果put对象不为空 把封装好的put对象写出去--->keyout
if (!put.isEmpty()){
context.write(new Text(rowkey),put);
}
}
}
4.2 HBaseReducer
public class HBaseReducer extends TableReducer<Text, Put, Put> {
/**
*
* @param key mapper输出过来的key 此处是rowKey
* @param values mapper输出过来的values 此处就是一行数据封装成的put对象
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void reduce(Text key, Iterable<Put> values, Context context) throws IOException, InterruptedException {
//遍历values 输出put对象
for (Put value : values) {
context.write(null,value);
}
}
}
4.3 HBaseClient
public class HBaseClient {
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
//设置运行的主类
job.setJarByClass(HBaseClient.class);
Scan scan = new Scan();
//setCaching设置的值为每次rpc的请求记录数,默认是1;cache大可以优化性能,但是太大了会花费很长的时间进行一次传输。
scan.setCaching(500);
scan.setCacheBlocks(false);
//使用TableMapReduceUtil 工具类来初始化我们的mapper
//参数(表名,scan对象,mapper类,输出key类型,输出value类型,job对象)
TableMapReduceUtil.initTableMapperJob(
TableName.valueOf("myuser"),
scan,HBaseMapper.class,
Text.class,
Put.class,
job);
//使用TableMapReduceUtil 工具类来初始化我们的reducer
//参数 (表名,reducer类,job对象)
TableMapReduceUtil.initTableReducerJob(
"myuser2",
HBaseReducer.class,
job);
boolean result = job.waitForCompletion(true);
System.exit(result?0 :1);
}
}
注意,我们需要使用打包插件,将HBase的依赖jar包都打入到工程jar包里面去,然后执行
5 扩展资料
5.1 常用API
java类 | 对应数据模型 |
---|---|
HBaseConfiguration | HBase配置类 |
HBaseAdmin | 数据库(DataBase) |
HTable | 表(Table) |
Put | HBase添加操作数据模型 |
Get | HBase单个查询操作数据模型 |
Scan | HBase Scan检索操作数据模型 |
Result | HBase单个查询的结果模型 |
ResultScanner | HBase检索结果模型 |
HTableDescriptor | 表描述器 |
HColumnDescriptor | 列族描述器 |
5.2 TableMapReduceUtil用法
TableMapReduceUtil.initTableMapperJob(
sourceTable, // input table
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper class
null, // mapper output key
null, // mapper output value
job);
TableMapReduceUtil.initTableReducerJob(
targetTable, // output table
null, // reducer class
job);
5.3 官网详细的参考资料地址
http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.14.0/book.html#mapreduce