利用MapReduce从Oracle导数据到Hbase

利用MapReduce从Oracle导数据到Hbase

MapReduce是一种可用于数据处理的编程模型,主要有Map和Reduce两个概念。

MapReduce Job 作业流程

运行程序后,会提交到JobClient,拿到一个唯一Job ID,之后程序被提交到Hadoop文件系统(HDFS),集群初始任务,将分发到节点执行。
MapReduce运行过程

代码实现

说明:所有代码都在同一类里编写,要不会执行不了。

main方法

先看下程序的概述

public static void main(String[] args) throws Exception {
        //获取配置信息  (这里注明看备注说明)
        Configuration conf = HBaseConfiguration.create();
        //获取数据库连接相关信息
        InputStream in = OracleToHbaseSale.class.getResourceAsStream("../conf/db.properties");
        Properties properties = new Properties();
        properties.load(in);
        String diverClass = properties.getProperty("driver.class");
        String url = properties.getProperty("driver.url");
        String userName = properties.getProperty("driver.username");
        String passWord = properties.getProperty("driver.password");
        //oracle驱动信息
        DBConfiguration.configureDB(conf, diverClass,
                url, userName, passWord);
        //这里一定设置,当前类
        job.setJarByClass(OracleToHbaseSale.class);
        //输入格式
        job.setInputFormatClass(DBInputFormat.class);
        //从oracle中读取所需要的字段
        String[] fields = { "id", "ypbm", "uploadtime", "scph", "corpid", "xssl", "dw", "ghdw","corpid_zw", "ghdw_zw" };
        DBInputFormat.setInput(job, XsTable.class, "LTJG_MEDI_SALE_FOR_HBASE", null, null, fields);
        //设置map,class
        job.setMapperClass(DBInputMapper.class);
        //设置map的输出的key的类型
        job.setMapOutputKeyClass(Text.class);
        //设置map的输出value的类型
        job.setMapOutputValueClass(Text.class);
        //reduce的设置
        TableMapReduceUtil.initTableReducerJob("h_medi_sale", HBaseToHBaseReducer.class, job);

        boolean success = job.waitForCompletion(true);

        System.exit(success ? 0 : 1);
    }


  1. 获取Hbase相关配置信息,为提交Job作准备。
  2. 拿到相数据库连接相关信息,配置DBConfiguration,为MapReduce与关系型库(这里是Oracle)交互作前提。
  3. 设置InputFormat、InputMapper、Mapper、Reducer

说明:
1、Configuration 获取配置信息。当在 Eclipse中调试里,需要把配置文件 hbase-site.xml放到 resources目录下或直接通过代码 set 集群的配置;当在Hadoop其中一个节点执行时,则直接 HBaseConfiguration.create()
2、当 DBInputFormat的setInput 不满足实际场景 需要时,可以继承DBInputFormat参考源码,根据业务需要 重写setInput方法

InputFormat

因为数据库输入,这里选择DBInputFormat。
DBInputFormat的setInput方法可以输入的表及字段信息,我这里选择下面这种形态来设置,别还有别一种形态可以通过写SQL来查询,具体使用查看官方文档说明。

DBInputFormat.setInput

JobConf、tableName、conditions、orderBy、fieldNames(Hbase配置、查询表名、查询条件、排序列、查询列),都可以在调用的时候输入,所以这里要写inputClass,DBWritable的实现子类。

  • DBWritable

DBWritable是个接口,有write、readFields两个方法。write方法,负责将对象的字段写入PreparedStatement;readFields方法,设置从关系型数据库里读取的字段。

public static class XsTable implements DBWritable {
        private String id;
        private String ypbm;
        private Date uploadtime;
        private String scph;
        private String corpid;
        private Double xssl;
        private String dw;
        private String ghdw;
        private String corpid_zw;
        private String ghdw_zw;

        @Override
        public void readFields(ResultSet result) throws SQLException {
            id = result.getString("id");
            ypbm = result.getString("ypbm");
            uploadtime = result.getDate("uploadtime");
            scph = result.getString("scph");
            corpid = result.getString("corpid");
            xssl = result.getDouble("xssl");
            dw = result.getString("dw");
            ghdw = result.getString("ghdw");
            corpid_zw = result.getString("corpid_zw");
            ghdw_zw = result.getString("ghdw_zw");
        }

        @Override
        public void write(PreparedStatement stmt) throws SQLException {
            stmt.setString(1, id);
            stmt.setString(2, ypbm);
            stmt.setDate(3, uploadtime);
            stmt.setString(4, scph);
            stmt.setString(5, corpid);
            stmt.setDouble(6, xssl);
            stmt.setString(7, dw);
            stmt.setString(8, ghdw);
            stmt.setString(9, corpid_zw);
            stmt.setString(10, ghdw_zw);
        }
    }

Mapper

map是将输入数据转换为中间数据,输出存储在hadoop Master本地磁盘上。

这里 编写Mapper子类DBInputMapper,
查看抽象类Mapper源代码依次有KEYIN, VALUEIN, KEYOUT, VALUEOUT 4个 输入参数。
KEYIN: Key的输入值
VALUEIN: Value的输入值
KEYOUT: Key的输入类型
VALUEOUT: Value输入类型

重写map方法,为写入Hbase映射数据,详细请看官方文档
多线程可以通过重写Mapper的run方法实现,运行多个map。

public static class DBInputMapper extends Mapper<LongWritable, YsTable, Text, Text> {
        @Override
        protected void map(LongWritable key, YsTable value, Mapper<LongWritable, YsTable, Text, Text>.Context context)
                throws IOException, InterruptedException {
            String yuefen = null;
            Integer tian = null;
            String time = null;
            String ypbm = value.ypbm;
            Date uploadtime = value.uploadtime;
            String scph = value.scph;
            String corpid = value.corpid;
            Double sl = value.sjsl;
            String dw = value.dw;
            String ghdw = value.ghdw;
            String corpid_zw = value.corpid_zw;
            String ghdw_zw = value.ghdw_zw;
            if (StringUtils.isBlank(ypbm)) {
                ypbm = "isnull";
            }

            if (StringUtils.isBlank(scph)) {
                scph = "isnull";
            }
            if (StringUtils.isBlank(corpid)) {
                corpid = "isnull";
            }
            if (sl == null) {
                sl = 0d;
            }
            if (StringUtils.isBlank(dw)) {
                dw = "isnull";
            }
            if (StringUtils.isBlank(ghdw)) {
                ghdw = "isnull";
            }
            if (StringUtils.isBlank(corpid_zw)) {
                corpid_zw = StringUtils.isBlank(corpid) ? "isnull" : corpid;
            }
            if (StringUtils.isBlank(ghdw_zw)) {
                ghdw_zw = StringUtils.isBlank(ghdw) ? "isnull" : ghdw;
            }

            if (uploadtime != null) {
                DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
                time = dateFormat.format(value.uploadtime).replace("-", "");
                yuefen = time.substring(0, 6);
                tian = Integer.parseInt(time.substring(6, time.length()));

                StringBuffer newRowkey = new StringBuffer();

                //map的key的设置 : ypbm~uploadtime[yyyyMM]~scph~uploadtime[dd]
                newRowkey.append(ypbm).append(SPLIT).append(yuefen).append(SPLIT).append(scph).append(SPLIT)
                        .append(tian);
                Col col = new Col(value.id,ypbm,scph, corpid, Double.toString(sl), dw, ghdw, corpid_zw, ghdw_zw);// map的value设置
                Text k = new Text(newRowkey.toString());
                Text v = new Text(JSONObject.toJSONString(col));
                //这里千万别忘记了,要通过Mapper上下文写入数据。
                context.write(k, v);
            }

        }
    }

TableReducer(Reducer)

Hbase 抽象类 TableReducer是继承hadoop的Reducer类,所以这里直接看Reducer的官方文档
Reducer的输入参数跟Mapper的一致。
KEYIN: Key的输入值
VALUEIN: Value的输入值
KEYOUT: Key的输入类型
VALUEOUT: Value输入类型

public abstract class TableReducer<KEYIN, VALUEIN, KEYOUT>
extends Reducer<KEYIN, VALUEIN, KEYOUT, Writable> {
}

TableReducer在继承Reducer,指定了Value的类型为Writable。由于Hbase只存储字节,没有其它数据类型。
重写Reducer的reduce方法,实际是通过TableReducer传递。

public static class HBaseToHBaseReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {

        protected void reduce(Text key, Iterable<Text> value,
                Reducer<Text, Text, ImmutableBytesWritable, Mutation>.Context context) throws IOException,
                InterruptedException {

            String strVey = key.toString();
            //ypbm~uploadtime[yyyyMM]~scph 作为rowkey
            String rowkey = StringUtils.substringBeforeLast(strVey, SPLIT);
            //uploadtime[dd] 作为列名
            String colName = StringUtils.substringAfterLast(strVey, SPLIT);
            byte[] bytesRowkey = Bytes.toBytes(rowkey);
            List<Col> list = new ArrayList<Col>();
            Put put = new Put(bytesRowkey);
            Iterator<Text> iterator = value.iterator();
            while (iterator.hasNext()) {
                Col col = JSONObject.parseObject(iterator.next().toString(), Col.class);
                list.add(col);
            }
            put.add(Bytes.toBytes(COLSNAME), Bytes.toBytes(colName), Bytes.toBytes(JSONObject.toJSONString(list)));
            //这里千万别忘记了,要通过TableReducer上下文写入数据。
            context.write(new ImmutableBytesWritable(bytesRowkey), put);
        }
    }

运行MapReduce

把写好的程序打包成jar,上传Hadoop其中一个节点服务器的HADOOP_CLASSPATH
运行

hadoop jar XXXXX.jar

MapReduce概述

MapReduce有Input files、Map phase、Intermediate files、Reduce phase、Output files五个阶段。

这里的例子
关系型数据库Oracle对应着Input files
Mapper对应着Map phase
Mapper.write(key,value)对应着Intermediate files
TableReducer(Reducer)对应着Reduce phase
Reducer.write(key,value)对应着Output files,写到Hbase数据库中。

MapReduce细节

Maven pom依赖包

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>commons-configuration</groupId>
            <artifactId>commons-configuration</artifactId>
            <version>1.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-auth</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>1.3.2</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.1.3</version>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>0.96.1-hadoop2</version>
        </dependency>
        <dependency>
            <groupId>com.google.protobuf</groupId>
            <artifactId>protobuf-java</artifactId>
            <version>2.5.0</version>
        </dependency>
        <dependency>
            <groupId>io.netty</groupId>
            <artifactId>netty</artifactId>
            <version>3.6.6.Final</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-common</artifactId>
            <version>0.96.1-hadoop2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-protocol</artifactId>
            <version>0.96.1-hadoop2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.zookeeper</groupId>
            <artifactId>zookeeper</artifactId>
            <version>3.4.5</version>
        </dependency>
        <dependency>
            <groupId>org.cloudera.htrace</groupId>
            <artifactId>htrace-core</artifactId>
            <version>2.01</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-mapper-asl</artifactId>
            <version>1.9.13</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-core-asl</artifactId>
            <version>1.9.13</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-jaxrs</artifactId>
            <version>1.9.13</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-xc</artifactId>
            <version>1.9.13</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.6.4</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.6.4</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.46</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase</artifactId>
            <version>0.90.2</version>
        </dependency>
    </dependencies>

完整程序代码

github地址:https://github.com/huangliuyu00/bigdata/tree/master/OracleToHbase

评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值