利用MapReduce从Oracle导数据到Hbase
MapReduce是一种可用于数据处理的编程模型,主要有Map和Reduce两个概念。
MapReduce Job 作业流程
运行程序后,会提交到JobClient,拿到一个唯一Job ID,之后程序被提交到Hadoop文件系统(HDFS),集群初始任务,将分发到节点执行。
代码实现
说明:所有代码都在同一类里编写,要不会执行不了。
main方法
先看下程序的概述
public static void main(String[] args) throws Exception {
//获取配置信息 (这里注明看备注说明)
Configuration conf = HBaseConfiguration.create();
//获取数据库连接相关信息
InputStream in = OracleToHbaseSale.class.getResourceAsStream("../conf/db.properties");
Properties properties = new Properties();
properties.load(in);
String diverClass = properties.getProperty("driver.class");
String url = properties.getProperty("driver.url");
String userName = properties.getProperty("driver.username");
String passWord = properties.getProperty("driver.password");
//oracle驱动信息
DBConfiguration.configureDB(conf, diverClass,
url, userName, passWord);
//这里一定设置,当前类
job.setJarByClass(OracleToHbaseSale.class);
//输入格式
job.setInputFormatClass(DBInputFormat.class);
//从oracle中读取所需要的字段
String[] fields = { "id", "ypbm", "uploadtime", "scph", "corpid", "xssl", "dw", "ghdw","corpid_zw", "ghdw_zw" };
DBInputFormat.setInput(job, XsTable.class, "LTJG_MEDI_SALE_FOR_HBASE", null, null, fields);
//设置map,class
job.setMapperClass(DBInputMapper.class);
//设置map的输出的key的类型
job.setMapOutputKeyClass(Text.class);
//设置map的输出value的类型
job.setMapOutputValueClass(Text.class);
//reduce的设置
TableMapReduceUtil.initTableReducerJob("h_medi_sale", HBaseToHBaseReducer.class, job);
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
- 获取Hbase相关配置信息,为提交Job作准备。
- 拿到相数据库连接相关信息,配置DBConfiguration,为MapReduce与关系型库(这里是Oracle)交互作前提。
- 设置InputFormat、InputMapper、Mapper、Reducer。
说明:
1、Configuration 获取配置信息。当在 Eclipse中调试里,需要把配置文件 hbase-site.xml放到 resources目录下或直接通过代码 set 集群的配置;当在Hadoop其中一个节点执行时,则直接 HBaseConfiguration.create()。
2、当 DBInputFormat的setInput 不满足实际场景 需要时,可以继承DBInputFormat参考源码,根据业务需要 重写setInput方法。
InputFormat
因为数据库输入,这里选择DBInputFormat。
DBInputFormat的setInput方法可以输入的表及字段信息,我这里选择下面这种形态来设置,别还有别一种形态可以通过写SQL来查询,具体使用查看官方文档说明。
JobConf、tableName、conditions、orderBy、fieldNames(Hbase配置、查询表名、查询条件、排序列、查询列),都可以在调用的时候输入,所以这里要写inputClass,DBWritable的实现子类。
- DBWritable
DBWritable是个接口,有write、readFields两个方法。write方法,负责将对象的字段写入PreparedStatement;readFields方法,设置从关系型数据库里读取的字段。
public static class XsTable implements DBWritable {
private String id;
private String ypbm;
private Date uploadtime;
private String scph;
private String corpid;
private Double xssl;
private String dw;
private String ghdw;
private String corpid_zw;
private String ghdw_zw;
@Override
public void readFields(ResultSet result) throws SQLException {
id = result.getString("id");
ypbm = result.getString("ypbm");
uploadtime = result.getDate("uploadtime");
scph = result.getString("scph");
corpid = result.getString("corpid");
xssl = result.getDouble("xssl");
dw = result.getString("dw");
ghdw = result.getString("ghdw");
corpid_zw = result.getString("corpid_zw");
ghdw_zw = result.getString("ghdw_zw");
}
@Override
public void write(PreparedStatement stmt) throws SQLException {
stmt.setString(1, id);
stmt.setString(2, ypbm);
stmt.setDate(3, uploadtime);
stmt.setString(4, scph);
stmt.setString(5, corpid);
stmt.setDouble(6, xssl);
stmt.setString(7, dw);
stmt.setString(8, ghdw);
stmt.setString(9, corpid_zw);
stmt.setString(10, ghdw_zw);
}
}
Mapper
map是将输入数据转换为中间数据,输出存储在hadoop Master本地磁盘上。
这里 编写Mapper子类DBInputMapper,
查看抽象类Mapper源代码依次有KEYIN, VALUEIN, KEYOUT, VALUEOUT 4个 输入参数。
KEYIN: Key的输入值
VALUEIN: Value的输入值
KEYOUT: Key的输入类型
VALUEOUT: Value输入类型
重写map方法,为写入Hbase映射数据,详细请看官方文档。
多线程可以通过重写Mapper的run方法实现,运行多个map。
public static class DBInputMapper extends Mapper<LongWritable, YsTable, Text, Text> {
@Override
protected void map(LongWritable key, YsTable value, Mapper<LongWritable, YsTable, Text, Text>.Context context)
throws IOException, InterruptedException {
String yuefen = null;
Integer tian = null;
String time = null;
String ypbm = value.ypbm;
Date uploadtime = value.uploadtime;
String scph = value.scph;
String corpid = value.corpid;
Double sl = value.sjsl;
String dw = value.dw;
String ghdw = value.ghdw;
String corpid_zw = value.corpid_zw;
String ghdw_zw = value.ghdw_zw;
if (StringUtils.isBlank(ypbm)) {
ypbm = "isnull";
}
if (StringUtils.isBlank(scph)) {
scph = "isnull";
}
if (StringUtils.isBlank(corpid)) {
corpid = "isnull";
}
if (sl == null) {
sl = 0d;
}
if (StringUtils.isBlank(dw)) {
dw = "isnull";
}
if (StringUtils.isBlank(ghdw)) {
ghdw = "isnull";
}
if (StringUtils.isBlank(corpid_zw)) {
corpid_zw = StringUtils.isBlank(corpid) ? "isnull" : corpid;
}
if (StringUtils.isBlank(ghdw_zw)) {
ghdw_zw = StringUtils.isBlank(ghdw) ? "isnull" : ghdw;
}
if (uploadtime != null) {
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
time = dateFormat.format(value.uploadtime).replace("-", "");
yuefen = time.substring(0, 6);
tian = Integer.parseInt(time.substring(6, time.length()));
StringBuffer newRowkey = new StringBuffer();
//map的key的设置 : ypbm~uploadtime[yyyyMM]~scph~uploadtime[dd]
newRowkey.append(ypbm).append(SPLIT).append(yuefen).append(SPLIT).append(scph).append(SPLIT)
.append(tian);
Col col = new Col(value.id,ypbm,scph, corpid, Double.toString(sl), dw, ghdw, corpid_zw, ghdw_zw);// map的value设置
Text k = new Text(newRowkey.toString());
Text v = new Text(JSONObject.toJSONString(col));
//这里千万别忘记了,要通过Mapper上下文写入数据。
context.write(k, v);
}
}
}
TableReducer(Reducer)
Hbase 抽象类 TableReducer是继承hadoop的Reducer类,所以这里直接看Reducer的官方文档。
Reducer的输入参数跟Mapper的一致。
KEYIN: Key的输入值
VALUEIN: Value的输入值
KEYOUT: Key的输入类型
VALUEOUT: Value输入类型
public abstract class TableReducer<KEYIN, VALUEIN, KEYOUT>
extends Reducer<KEYIN, VALUEIN, KEYOUT, Writable> {
}
TableReducer在继承Reducer,指定了Value的类型为Writable。由于Hbase只存储字节,没有其它数据类型。
重写Reducer的reduce方法,实际是通过TableReducer传递。
public static class HBaseToHBaseReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {
protected void reduce(Text key, Iterable<Text> value,
Reducer<Text, Text, ImmutableBytesWritable, Mutation>.Context context) throws IOException,
InterruptedException {
String strVey = key.toString();
//ypbm~uploadtime[yyyyMM]~scph 作为rowkey
String rowkey = StringUtils.substringBeforeLast(strVey, SPLIT);
//uploadtime[dd] 作为列名
String colName = StringUtils.substringAfterLast(strVey, SPLIT);
byte[] bytesRowkey = Bytes.toBytes(rowkey);
List<Col> list = new ArrayList<Col>();
Put put = new Put(bytesRowkey);
Iterator<Text> iterator = value.iterator();
while (iterator.hasNext()) {
Col col = JSONObject.parseObject(iterator.next().toString(), Col.class);
list.add(col);
}
put.add(Bytes.toBytes(COLSNAME), Bytes.toBytes(colName), Bytes.toBytes(JSONObject.toJSONString(list)));
//这里千万别忘记了,要通过TableReducer上下文写入数据。
context.write(new ImmutableBytesWritable(bytesRowkey), put);
}
}
运行MapReduce
把写好的程序打包成jar,上传Hadoop其中一个节点服务器的HADOOP_CLASSPATH
运行
hadoop jar XXXXX.jar
MapReduce概述
MapReduce有Input files、Map phase、Intermediate files、Reduce phase、Output files五个阶段。
这里的例子
关系型数据库Oracle对应着Input files
Mapper对应着Map phase
Mapper.write(key,value)对应着Intermediate files
TableReducer(Reducer)对应着Reduce phase
Reducer.write(key,value)对应着Output files,写到Hbase数据库中。
Maven pom依赖包
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-auth</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.3</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>0.96.1-hadoop2</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>io.netty</groupId>
<artifactId>netty</artifactId>
<version>3.6.6.Final</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>0.96.1-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>0.96.1-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.5</version>
</dependency>
<dependency>
<groupId>org.cloudera.htrace</groupId>
<artifactId>htrace-core</artifactId>
<version>2.01</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-jaxrs</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-xc</artifactId>
<version>1.9.13</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.6.4</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.6.4</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.46</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>0.90.2</version>
</dependency>
</dependencies>
完整程序代码
github地址:https://github.com/huangliuyu00/bigdata/tree/master/OracleToHbase