利用MapReduce从Oracle导数据到Hbase

最新推荐文章于 2023-06-26 15:39:50 发布

写代码的蓝胖子

最新推荐文章于 2023-06-26 15:39:50 发布

阅读量1.8k

点赞数 1

分类专栏： Hadoop 文章标签： mapreduce hbase hadoop

本文链接：https://blog.csdn.net/huangliuyu00/article/details/79676644

版权

Hadoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

利用MapReduce从Oracle导数据到Hbase

MapReduce是一种可用于数据处理的编程模型，主要有Map和Reduce两个概念。

利用MapReduce从Oracle导数据到Hbase

MapReduce Job 作业流程

运行程序后，会提交到JobClient，拿到一个唯一Job ID，之后程序被提交到Hadoop文件系统（HDFS），集群初始任务，将分发到节点执行。
MapReduce运行过程

代码实现

说明：所有代码都在同一类里编写，要不会执行不了。

main方法

先看下程序的概述

public static void main(String[] args) throws Exception {
        //获取配置信息  (这里注明看备注说明)
        Configuration conf = HBaseConfiguration.create();
        //获取数据库连接相关信息
        InputStream in = OracleToHbaseSale.class.getResourceAsStream("../conf/db.properties");
        Properties properties = new Properties();
        properties.load(in);
        String diverClass = properties.getProperty("driver.class");
        String url = properties.getProperty("driver.url");
        String userName = properties.getProperty("driver.username");
        String passWord = properties.getProperty("driver.password");
        //oracle驱动信息
        DBConfiguration.configureDB(conf, diverClass,
                url, userName, passWord);
        //这里一定设置，当前类
        job.setJarByClass(OracleToHbaseSale.class);
        //输入格式
        job.setInputFormatClass(DBInputFormat.class);
        //从oracle中读取所需要的字段
        String[] fields = { "id", "ypbm", "uploadtime", "scph", "corpid", "xssl", "dw", "ghdw","corpid_zw", "ghdw_zw" };
        DBInputFormat.setInput(job, XsTable.class, "LTJG_MEDI_SALE_FOR_HBASE", null, null, fields);
        //设置map，class
        job.setMapperClass(DBInputMapper.class);
        //设置map的输出的key的类型
        job.setMapOutputKeyClass(Text.class);
        //设置map的输出value的类型
        job.setMapOutputValueClass(Text.class);
        //reduce的设置
        TableMapReduceUtil.initTableReducerJob("h_medi_sale", HBaseToHBaseReducer.class, job);

        boolean success = job.waitForCompletion(true);

        System.exit(success ? 0 : 1);
    }

获取Hbase相关配置信息，为提交Job作准备。
拿到相数据库连接相关信息，配置DBConfiguration，为MapReduce与关系型库（这里是Oracle）交互作前提。
设置InputFormat、InputMapper、Mapper、Reducer。

说明：
1、Configuration 获取配置信息。当在 Eclipse中调试里，需要把配置文件 hbase-site.xml放到 resources目录下或直接通过代码 set 集群的配置；当在Hadoop其中一个节点执行时，则直接 HBaseConfiguration.create()。
2、当 DBInputFormat的setInput 不满足实际场景 需要时，可以继承DBInputFormat参考源码，根据业务需要 重写setInput方法。

InputFormat

因为数据库输入，这里选择DBInputFormat。
DBInputFormat的setInput方法可以输入的表及字段信息，我这里选择下面这种形态来设置，别还有别一种形态可以通过写SQL来查询，具体使用查看官方文档说明。

DBInputFormat.setInput

JobConf、tableName、conditions、orderBy、fieldNames（Hbase配置、查询表名、查询条件、排序列、查询列），都可以在调用的时候输入，所以这里要写inputClass，DBWritable的实现子类。

DBWritable

DBWritable是个接口，有write、readFields两个方法。write方法，负责将对象的字段写入PreparedStatement；readFields方法，设置从关系型数据库里读取的字段。

public static class XsTable implements DBWritable {
        private String id;
        private String ypbm;
        private Date uploadtime;
        private String scph;
        private String corpid;
        private Double xssl;
        private String dw;
        private String ghdw;
        private String corpid_zw;
        private String ghdw_zw;

        @Override
        public void readFields(ResultSet result) throws SQLException {
            id = result.getString("id");
            ypbm = result.getString("ypbm");
            uploadtime = result.getDate("uploadtime");
            scph = result.getString("scph");
            corpid = result.getString("corpid");
            xssl = result.getDouble("xssl");
            dw = result.getString("dw");
            ghdw = result.getString("ghdw");
            corpid_zw = result.getString("corpid_zw");
            ghdw_zw = result.getString("ghdw_zw");
        }

        @Override
        public void write(PreparedStatement stmt) throws SQLException {
            stmt.setString(1, id);
            stmt.setString(2, ypbm);
            stmt.setDate(3, uploadtime);
            stmt.setString(4, scph);
            stmt.setString(5, corpid);
            stmt.setDouble(6, xssl);
            stmt.setString(7, dw);
            stmt.setString(8, ghdw);
            stmt.setString(9, corpid_zw);
            stmt.setString(10, ghdw_zw);
        }
    }

Mapper

map是将输入数据转换为中间数据，输出存储在hadoop Master本地磁盘上。

这里编写Mapper子类DBInputMapper，
查看抽象类Mapper源代码依次有KEYIN, VALUEIN, KEYOUT, VALUEOUT 4个输入参数。
KEYIN： Key的输入值
VALUEIN： Value的输入值
KEYOUT： Key的输入类型
VALUEOUT： Value输入类型

重写map方法，为写入Hbase映射数据，详细请看官方文档。
多线程可以通过重写Mapper的run方法实现，运行多个map。

public static class DBInputMapper extends Mapper<LongWritable, YsTable, Text, Text> {
        @Override
        protected void map(LongWritable key, YsTable value, Mapper<LongWritable, YsTable, Text, Text>.Context context)
                throws IOException, InterruptedException {
            String yuefen = null;
            Integer tian = null;
            String time = null;
            String ypbm = value.ypbm;
            Date uploadtime = value.uploadtime;
            String scph = value.scph;
            String corpid = value.corpid;
            Double sl = value.sjsl;
            String dw = value.dw;
            String ghdw = value.ghdw;
            String corpid_zw = value.corpid_zw;
            String ghdw_zw = value.ghdw_zw;
            if (StringUtils.isBlank(ypbm)) {
                ypbm = "isnull";
            }

            if (StringUtils.isBlank(scph)) {
                scph = "isnull";
            }
            if (StringUtils.isBlank(corpid)) {
                corpid = "isnull";
            }
            if (sl == null) {
                sl = 0d;
            }
            if (StringUtils.isBlank(dw)) {
                dw = "isnull";
            }
            if (StringUtils.isBlank(ghdw)) {
                ghdw = "isnull";
            }
            if (StringUtils.isBlank(corpid_zw)) {
                corpid_zw = StringUtils.isBlank(corpid) ? "isnull" : corpid;
            }
            if (StringUtils.isBlank(ghdw_zw)) {
                ghdw_zw = StringUtils.isBlank(ghdw) ? "isnull" : ghdw;
            }

            if (uploadtime != null) {
                DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
                time = dateFormat.format(value.uploadtime).replace("-", "");
                yuefen = time.substring(0, 6);
                tian = Integer.parseInt(time.substring(6, time.length()));

                StringBuffer newRowkey = new StringBuffer();

                //map的key的设置 ： ypbm~uploadtime[yyyyMM]~scph~uploadtime[dd]
                newRowkey.append(ypbm).append(SPLIT).append(yuefen).append(SPLIT).append(scph).append(SPLIT)
                        .append(tian);
                Col col = new Col(value.id,ypbm,scph, corpid, Double.toString(sl), dw, ghdw, corpid_zw, ghdw_zw);// map的value设置
                Text k = new Text(newRowkey.toString());
                Text v = new Text(JSONObject.toJSONString(col));
                //这里千万别忘记了，要通过Mapper上下文写入数据。
                context.write(k, v);
            }

        }
    }

TableReducer（Reducer）

Hbase 抽象类 TableReducer是继承hadoop的Reducer类，所以这里直接看Reducer的官方文档。
Reducer的输入参数跟Mapper的一致。
KEYIN： Key的输入值
VALUEIN： Value的输入值
KEYOUT： Key的输入类型
VALUEOUT： Value输入类型

public abstract class TableReducer<KEYIN, VALUEIN, KEYOUT>
extends Reducer<KEYIN, VALUEIN, KEYOUT, Writable> {
}

TableReducer在继承Reducer，指定了Value的类型为Writable。由于Hbase只存储字节，没有其它数据类型。
重写Reducer的reduce方法，实际是通过TableReducer传递。

public static class HBaseToHBaseReducer extends TableReducer<Text, Text, ImmutableBytesWritable> {

        protected void reduce(Text key, Iterable<Text> value,
                Reducer<Text, Text, ImmutableBytesWritable, Mutation>.Context context) throws IOException,
                InterruptedException {

            String strVey = key.toString();
            //ypbm~uploadtime[yyyyMM]~scph 作为rowkey
            String rowkey = StringUtils.substringBeforeLast(strVey, SPLIT);
            //uploadtime[dd] 作为列名
            String colName = StringUtils.substringAfterLast(strVey, SPLIT);
            byte[] bytesRowkey = Bytes.toBytes(rowkey);
            List<Col> list = new ArrayList<Col>();
            Put put = new Put(bytesRowkey);
            Iterator<Text> iterator = value.iterator();
            while (iterator.hasNext()) {
                Col col = JSONObject.parseObject(iterator.next().toString(), Col.class);
                list.add(col);
            }
            put.add(Bytes.toBytes(COLSNAME), Bytes.toBytes(colName), Bytes.toBytes(JSONObject.toJSONString(list)));
            //这里千万别忘记了，要通过TableReducer上下文写入数据。
            context.write(new ImmutableBytesWritable(bytesRowkey), put);
        }
    }

运行MapReduce

把写好的程序打包成jar，上传Hadoop其中一个节点服务器的HADOOP_CLASSPATH
运行

hadoop jar XXXXX.jar

MapReduce概述

MapReduce有Input files、Map phase、Intermediate files、Reduce phase、Output files五个阶段。

这里的例子
关系型数据库Oracle对应着Input files
Mapper对应着Map phase
Mapper.write(key,value)对应着Intermediate files
TableReducer（Reducer）对应着Reduce phase
Reducer.write(key,value)对应着Output files，写到Hbase数据库中。

MapReduce细节

Maven pom依赖包

<dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>commons-configuration</groupId>
            <artifactId>commons-configuration</artifactId>
            <version>1.6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-auth</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.3.0</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>1.3.2</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.1.3</version>
        </dependency>
        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>1.2.17</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>0.96.1-hadoop2</version>
        </dependency>
        <dependency>
            <groupId>com.google.protobuf</groupId>
            <artifactId>protobuf-java</artifactId>
            <version>2.5.0</version>
        </dependency>
        <dependency>
            <groupId>io.netty</groupId>
            <artifactId>netty</artifactId>
            <version>3.6.6.Final</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-common</artifactId>
            <version>0.96.1-hadoop2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-protocol</artifactId>
            <version>0.96.1-hadoop2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.zookeeper</groupId>
            <artifactId>zookeeper</artifactId>
            <version>3.4.5</version>
        </dependency>
        <dependency>
            <groupId>org.cloudera.htrace</groupId>
            <artifactId>htrace-core</artifactId>
            <version>2.01</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-mapper-asl</artifactId>
            <version>1.9.13</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-core-asl</artifactId>
            <version>1.9.13</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-jaxrs</artifactId>
            <version>1.9.13</version>
        </dependency>
        <dependency>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-xc</artifactId>
            <version>1.9.13</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.6.4</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
            <version>1.6.4</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.46</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase</artifactId>
            <version>0.90.2</version>
        </dependency>
    </dependencies>