多线程读取DBF文件

最新推荐文章于 2023-08-18 09:30:25 发布

TMH_ITBOY

最新推荐文章于 2023-08-18 09:30:25 发布

阅读量1.1k

点赞数

分类专栏： java 大数据-hadoop hbase

本文链接：https://blog.csdn.net/lljjyy001/article/details/88959811

版权

大数据-hadoop 同时被 3 个专栏收录

14 篇文章 5 订阅

订阅专栏

hbase

8 篇文章 0 订阅

订阅专栏

java

6 篇文章 0 订阅

订阅专栏

Java多线程读取大文件

需求

需要将DBF文件解析后存储到HBase 或者HDFS.起初打算使用Kettle读取,然后转存到HBase,小文件还好,一下子就ok来,但是,遇到一个1G大小(测试阶段,实际生产远远大于1G)的时候,Kettle输出到HBase时实在太慢,可能由于HBase的技术水平有限,再怎么优化,还是很慢.于是想着自己写一个程序解决一下,结果还是和kettle的差不多,就有点尴尬(手动尬笑).不管怎么样,还是把实现逻辑贴出来吧.

DBF文件

DBF是一种特殊的文件格式.主要由头部和文件体构成.其中头部包含整个文件的描述信息.主要包含:

文件头的长度
记录数(类似mysql的一条一条记录)
每条记录的长度
每条记录包含的字段
每个字段的长度
每个字段的类型
等等

文件体主要由一条一条的记录组成.

字段1	字段2	字段3	字段4	字段5	字段6
字段1对应的值(10字节)	字段2对应的值(10字节)	字段3对应的值(20字节)	字段4对应的值(10字节)	字段6对应的值(10字节)	字段7对应的值(15字节)
字段1对应的值(10字节)	字段2对应的值(10字节)	字段3对应的值(20字节)	字段4对应的值(10字节)	字段6对应的值(10字节)	字段7对应的值(15字节)
字段1对应的值(10字节)	字段2对应的值(10字节)	字段3对应的值(20字节)	字段4对应的值(10字节)	字段6对应的值(10字节)	字段7对应的值(15字节)

大概就是上面这个样子的.

由于比较规整,所以解析起来还算比较方便.

多线程读取DBF文件

当一个DBF文件很大的时候,使用普通的单线程读取DBF文件的时候,会很耗时,所以我使用多线程并行读取dbf文件是个不错的选择.

实现步骤大概如下:

读取dbf的头文件,获取到上面列出来的头文件长度,记录数,字段名以及该字段对应的所占字节大小等信息.
计算每个线程要读取的记录数(一个线程读取多少行)
使用java nio包下的RandomAccessFile的FileChannel来负责每个线程的读取任务.
定一个输出接口,每解析一条记录,将读取后解析的一条记录对应的数据使用接口回调出去,方便处理(可以存储到HDFS,也可以存储为tsv文件,或者存储到HBase).

读取头文件

/**
     * 读取头文件信息
     *
     * @throws IOException
     */
    private void readHeader() throws IOException {
        DataInputStream dataInput = new DataInputStream(IOUtils.genFileInputStream(new File(file)));
        dbfHeader.read(dataInput, null, false);
        Charset charset = this.dbfHeader.getDetectedCharset();
        if (charset != null) {
            this.userCharset = charset;
        } else {
            this.userCharset = Charset.forName("GBK");
        }
        this.fields = dbfHeader.getFieldArray();
        // 头文件长度
        this.headerLength = dbfHeader.getHeaderLength();
        // 每条记录长度
        this.recordLength = dbfHeader.getRecordLength();
        // 记录数
        this.numberOfRecords = dbfHeader.numberOfRecords;
        dataInput.close();
    }

读取DBF文件的Task

private final class ReadTask implements Runnable {
        // 要读取的文件
        private String file;
        // 当前线程要读取文件的开始位置
        private long startPos;
        // 当前线程要读取的字节数
        private long readByteSize;
        // 每行的记录长度
        private int recordLength;

        private DBFFieldUtils[] fields;

        private Charset charset;

        private ReadListener listener;

        private Output output;

        private int num = 0;

        ReadTask(String file, long startPos, long readByteSize, int recordLength, DBFFieldUtils[] fields, Charset charset, ReadListener listener) {
            this.file = file;
            this.startPos = startPos;
            this.readByteSize = readByteSize;
            this.recordLength = recordLength;
            this.fields = fields;
            this.charset = charset;
            this.listener = listener;
        }

        public Output getOutput() {
            return output;
        }

        public void setOutput(Output output) {
            this.output = output;
        }

        @Override
        public void run() {
            this.read();
        }

        private void read() {
            RandomAccessFile randomAccessFile = null;
            FileChannel channel = null;
            try {
                randomAccessFile = new RandomAccessFile(file, "r");
                channel = randomAccessFile.getChannel();
                channel.position(startPos);
                // 结束位置
                long endPos = startPos + readByteSize;
                boolean isEnd = false;
                int buffLine = (int) (readByteSize / recordLength);
                buffLine = buffLine < Config.getReadLineSize() ? buffLine : Config.getReadLineSize();
                ByteBuffer buffer = ByteBuffer.allocate(this.recordLength * buffLine);
                while ((channel.read(buffer) != -1 && startPos < endPos) && !isEnd) {
                    byte[] bytes = new byte[buffer.position()];
                    buffer.flip();
                    buffer.get(bytes);
                    // 获取此次读取的行数
                    int readedLines = bytes.length / this.recordLength;
                    // 这里需要验证是不是文件的最后一行(文件的最后一行没有换行符,少一个字节)
                    if (readedLines * this.recordLength < bytes.length) {
                        isEnd = true;
                        readedLines += 1;
                    }
                    int start = 0;
                    for (int i = 0; i < readedLines; i++) {
                        // 获取每一行的数据
                        Field[] yssfields = new Field[fields.length];
                        for (int j = 0; j < fields.length; j++) {
                            byte[] fieldBytes = new byte[fields[j].getLength()];
                            System.arraycopy(bytes, start, fieldBytes, 0, fieldBytes.length);
                            String fieldContent = new String(fieldBytes, 0, fieldBytes.length, this.charset).trim();
                            start += fieldBytes.length;
                            Field yssField = new Field();
                            yssField.setFieldName(fields[j].getName());
                            yssField.setContent(fieldContent);
                            yssfields[j] = yssField;
                        }
                        start = start + 1;// 最后一个换行符的位置
                        if (this.output != null) {
                            output.write(yssfields, false);
                            num++;
                        }
                    }
                    this.startPos += bytes.length;
                    channel.position(startPos);
                    // 计算此次读完后剩下的字节数
                    // 此次读完后剩下的行数
                    buffLine = (int) ((readByteSize - recordLength * num) / recordLength);
                    buffLine = buffLine < Config.getReadLineSize() ? buffLine : Config.getReadLineSize();
                    buffer = ByteBuffer.allocate(this.recordLength * buffLine);
                }
            } catch (IOException e) {
                e.printStackTrace();

            } finally {
                IOUtils.close(randomAccessFile, channel);
                if (this.output != null) {
                    this.output.write(null, true);
                    this.output.onDestory();
                }
                if (this.listener != null) {
                    this.listener.finish(true, num);
                    System.out.println(Thread.currentThread() + "读了" + num + "行");
                }
            }
        }
    }

根据事先设置的线程池大小,计算每个线程要读取的线程任务量

/**
     * read start
     */
    public void start() {
        this.startTime = System.currentTimeMillis();
        try {
            readHeader();
            if (this.numberOfRecords <= Config.getReadLineSize()) {
                poolSize = 1;
            }
            // 计算每个线程应该读取的行数
            int readLines = this.numberOfRecords / poolSize;
            // 从头的位置开始读
            long start = headerLength + 1;

            // 已经读取的行数
            int line = 0;
            for (int i = 0; i < poolSize; i++) {
                // 每个线程读取的字节数 = 要读取的行数 * 每行的长度
                long readByteSize = readLines * this.recordLength;
                if (i < poolSize - 1) {
                    line += readLines;
                    ReadTask task = new ReadTask(file, start, readByteSize, this.recordLength, this.fields, userCharset, this);
                    Output output = genOutput();
                    if (output != null) {
                        task.setOutput(output);
                        output.onCreate();
                    }
                    this.threadPool.submit(task);
                    start = start + readByteSize;
                    System.out.println("线程" + i + "需要读" + readLines + "行");
                } else {
                    // 最后一个线程包揽剩下的所有数据
                    int lastLines = numberOfRecords - line;
                    readByteSize = lastLines * this.recordLength;
                    ReadTask task = new ReadTask(file, start, readByteSize, this.recordLength, this.fields, userCharset, this);
                    Output output = genOutput();
                    if (output != null) {
                        task.setOutput(output);
                        output.onCreate();
                    }
                    this.threadPool.submit(task);
                    System.out.println("线程" + i + "需要读" + lastLines + "行");
                }

            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

定义向外输出的接口

@SuppressWarnings("all")
public abstract class Output {

    /**
     * 用于解析后向外输出每条记录
     * @param fields 每条记录包含的字段(字段名,字段值)
     * @param isFinish  是否处理已经处理完
     */
    public abstract void write(Field[] fields, boolean isFinish);

    public void onCreate() {
    }

    public void onDestory() {
    }
}

完成的处理过程

package com.yss.dbf;

import java.io.DataInputStream;
import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

@SuppressWarnings("all")
public class ReadDBF implements ReadListener {

    private DBFHeaderUtils dbfHeader;
    // 编码
    private Charset userCharset;

    // 头长度
    private int headerLength;

    // 记录数
    private int numberOfRecords;

    // 每条记录长度
    private int recordLength;

    private int poolSize = Config.getThreadPoolSize();

    private ExecutorService threadPool;


    private String file;

    private DBFFieldUtils[] fields;

    // 已经结束线程的数量
    private int finishCount = 0;

    private long startTime = 0;

    private int totalLines = 0;

    private ReadStatusCallBack readStatusCallBack;
    //执行成功的线程数
    private int successThreadCount = 0;

    public ReadDBF(String file) {
        this.dbfHeader = new DBFHeaderUtils();
        this.file = file;
        this.threadPool = Executors.newFixedThreadPool(this.poolSize);
    }

    private Output genOutput() {
        Output output = null;
        if (Config.getOutputClassName() == null) return null;
        try {
            Class<?> clz = Class.forName(Config.getOutputClassName());
            Object obj = clz.newInstance();
            if (obj instanceof Output) {
                output = ((Output) obj);
            } else {
                System.err.println("输出类需要实现com.yss.dbf.Output");
            }
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (IllegalAccessException e) {
            e.printStackTrace();
        } catch (InstantiationException e) {
            e.printStackTrace();
        }
        return output;
    }

    public void start(ReadStatusCallBack readStatusCallBack) {
        this.readStatusCallBack = readStatusCallBack;
        this.start();
    }

    /**
     * read start
     */
    public void start() {
        this.startTime = System.currentTimeMillis();
        try {
            readHeader();
            if (this.numberOfRecords <= Config.getReadLineSize()) {
                poolSize = 1;
            }
            // 计算每个线程应该读取的行数
            int readLines = this.numberOfRecords / poolSize;
            // 从头的位置开始读
            long start = headerLength + 1;

            // 已经读取的行数
            int line = 0;
            for (int i = 0; i < poolSize; i++) {
                // 每个线程读取的字节数 = 要读取的行数 * 每行的长度
                long readByteSize = readLines * this.recordLength;
                if (i < poolSize - 1) {
                    line += readLines;
                    ReadTask task = new ReadTask(file, start, readByteSize, this.recordLength, this.fields, userCharset, this);
                    Output output = genOutput();
                    if (output != null) {
                        task.setOutput(output);
                        output.onCreate();
                    }
                    this.threadPool.submit(task);
                    start = start + readByteSize;
                    System.out.println("线程" + i + "需要读" + readLines + "行");
                } else {
                    // 最后一个线程包揽剩下的所有数据
                    int lastLines = numberOfRecords - line;
                    readByteSize = lastLines * this.recordLength;
                    ReadTask task = new ReadTask(file, start, readByteSize, this.recordLength, this.fields, userCharset, this);
                    Output output = genOutput();
                    if (output != null) {
                        task.setOutput(output);
                        output.onCreate();
                    }
                    this.threadPool.submit(task);
                    System.out.println("线程" + i + "需要读" + lastLines + "行");
                }

            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 读取头文件信息
     *
     * @throws IOException
     */
    private void readHeader() throws IOException {
        DataInputStream dataInput = new DataInputStream(IOUtils.genFileInputStream(new File(file)));
        dbfHeader.read(dataInput, null, false);
        Charset charset = this.dbfHeader.getDetectedCharset();
        if (charset != null) {
            this.userCharset = charset;
        } else {
            this.userCharset = Charset.forName("GBK");
        }
        this.fields = dbfHeader.getFieldArray();
        // 头文件长度
        this.headerLength = dbfHeader.getHeaderLength();
        // 每条记录长度
        this.recordLength = dbfHeader.getRecordLength();
        // 记录数
        this.numberOfRecords = dbfHeader.numberOfRecords;
        dataInput.close();
    }

    @Override
    public void finish(boolean success, int readLines) {
        this.finishCount++;
        totalLines += readLines;
        this.successThreadCount = success ? (++this.successThreadCount) : this.successThreadCount;
        if (this.finishCount == poolSize) {
            this.threadPool.shutdown();
            System.out.println("共花费:" + ((System.currentTimeMillis() - startTime) * 1.0 / 1000) + "秒,总计:" + totalLines + "行");
            if (readStatusCallBack != null) {
                readStatusCallBack.finish(this.successThreadCount == this.poolSize);
            }
        }

    }


    private final class ReadTask implements Runnable {
        // 要读取的文件
        private String file;
        // 当前线程要读取文件的开始位置
        private long startPos;
        // 当前线程要读取的字节数
        private long readByteSize;
        // 每行的记录长度
        private int recordLength;

        private DBFFieldUtils[] fields;

        private Charset charset;

        private ReadListener listener;

        private Output output;

        private int num = 0;

        ReadTask(String file, long startPos, long readByteSize, int recordLength, DBFFieldUtils[] fields, Charset charset, ReadListener listener) {
            this.file = file;
            this.startPos = startPos;
            this.readByteSize = readByteSize;
            this.recordLength = recordLength;
            this.fields = fields;
            this.charset = charset;
            this.listener = listener;
        }

        public Output getOutput() {
            return output;
        }

        public void setOutput(Output output) {
            this.output = output;
        }

        @Override
        public void run() {
            this.read();
        }

        private void read() {
            RandomAccessFile randomAccessFile = null;
            FileChannel channel = null;
            try {
                randomAccessFile = new RandomAccessFile(file, "r");
                channel = randomAccessFile.getChannel();
                channel.position(startPos);
                // 结束位置
                long endPos = startPos + readByteSize;
                boolean isEnd = false;
                int buffLine = (int) (readByteSize / recordLength);
                buffLine = buffLine < Config.getReadLineSize() ? buffLine : Config.getReadLineSize();
                ByteBuffer buffer = ByteBuffer.allocate(this.recordLength * buffLine);
                while ((channel.read(buffer) != -1 && startPos < endPos) && !isEnd) {
                    byte[] bytes = new byte[buffer.position()];
                    buffer.flip();
                    buffer.get(bytes);
                    // 获取此次读取的行数
                    int readedLines = bytes.length / this.recordLength;
                    // 这里需要验证是不是文件的最后一行(文件的最后一行没有换行符,少一个字节)
                    if (readedLines * this.recordLength < bytes.length) {
                        isEnd = true;
                        readedLines += 1;
                    }
                    int start = 0;
                    for (int i = 0; i < readedLines; i++) {
                        // 获取每一行的数据
                        Field[] yssfields = new Field[fields.length];
                        for (int j = 0; j < fields.length; j++) {
                            byte[] fieldBytes = new byte[fields[j].getLength()];
                            System.arraycopy(bytes, start, fieldBytes, 0, fieldBytes.length);
                            String fieldContent = new String(fieldBytes, 0, fieldBytes.length, this.charset).trim();
                            start += fieldBytes.length;
                            Field yssField = new Field();
                            yssField.setFieldName(fields[j].getName());
                            yssField.setContent(fieldContent);
                            yssfields[j] = yssField;
                        }
                        start = start + 1;// 最后一个换行符的位置
                        if (this.output != null) {
                            output.write(yssfields, false);
                            num++;
                        }
                    }
                    this.startPos += bytes.length;
                    channel.position(startPos);
                    // 计算此次读完后剩下的字节数
                    // 此次读完后剩下的行数
                    buffLine = (int) ((readByteSize - recordLength * num) / recordLength);
                    buffLine = buffLine < Config.getReadLineSize() ? buffLine : Config.getReadLineSize();
                    buffer = ByteBuffer.allocate(this.recordLength * buffLine);
                }
            } catch (IOException e) {
                e.printStackTrace();

            } finally {
                IOUtils.close(randomAccessFile, channel);
                if (this.output != null) {
                    this.output.write(null, true);
                    this.output.onDestory();
                }
                if (this.listener != null) {
                    this.listener.finish(true, num);
                    System.out.println(Thread.currentThread() + "读了" + num + "行");
                }
            }
        }
    }

}

我使用的是通过配置文件设置线程池的大小,一次读取多少行记录,输出接口的实现类等.
配置文件如下:

#输出文件分隔符
output.fields.delimited=#
#一次读取多少行
read.line.size=100
#线程池数量
thread.pool.size=7
#输出类
output.class.name=com.yss.dbf.out.HBaseOutput
#往HBase里面批量提交的数量
hbase.output.batch=50000
#读取dbf文件并输出成文本文件,供mr转换成HFile的路径
dbf.textfile.output.path=
#将dbf.textfile.output.path的文件上传到hdfs
dbf.textfile.hdfs.tmp.path=
#将转化成hfile文件的hdfs路径
hfile.hdfs.path=
#==========================================
#HBase配置
hbase.zookeeper.property.clientPort=2181
#Zookeeper 集群的地址列表，用逗号分割 如:vhost1,vhost1,vhost3 其中vhost1为zk所在的主机地址
hbase.zookeeper.quorum=vhost1,vhost1,vhost3
#HBase在ZK中的节点路径
zookeeper.znode.parent=/hbase-unsecure
hbase.table.name=ktr
hbase.table.family=cf

输出到HBase的输出类.

package com.yss.dbf.out;

import com.yss.dbf.Config;
import com.yss.dbf.Field;
import com.yss.dbf.IOUtils;
import com.yss.dbf.Output;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executors;

@SuppressWarnings("all")
public class HBaseOutput1 extends Output {

    private Connection conn;
    private final int BATCHSIZE = Config.getHbaseOutputBatch();
    private BufferedMutator mutator;
    private List<Put> list;

    @Override
    public void onCreate() {
        try {
            conn = ConnectionFactory.createConnection(Config.getConf());
            BufferedMutatorParams params = new BufferedMutatorParams(TableName.valueOf(Config.getHbaseTableName()));
            params.writeBufferSize(1024 * 1024 * 10);
            params.pool(Executors.newFixedThreadPool(2));
            mutator = conn.getBufferedMutator(params);
            list = new ArrayList<>();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void onDestory() {
        IOUtils.close(conn, mutator);
    }

    @Override
    public void write(Field[] fields, boolean isFinish) {
        if (!isFinish) {
            if (fields.length > 4) {
                String rowKey = genRowKey(fields[3].getContent(),
                        fields[0].getContent(),
                        fields[1].getContent(),
                        fields[2].getContent());
                Put put = new Put(Bytes.toBytes(rowKey));
                for (int i = 4; i < fields.length; i++) {
                    put.addColumn(Bytes.toBytes(Config.getHbaseTableFamily()),
                            Bytes.toBytes(fields[i].getFieldName()),
                            Bytes.toBytes(fields[i].getContent()));
                }
                this.list.add(put);

                if (this.list.size() == BATCHSIZE) {
                    commit();
                }
            }
        } else {
            commit();
            System.out.println(Thread.currentThread().getName() + "完成!");
        }
    }

    private void commit() {
        try {
            mutator.mutate(list);
            mutator.flush();
            this.list.clear();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private String genRowKey(String... fields) {
        StringBuilder sb = new StringBuilder();
        for (String field : fields) {
            sb.append(field)
                    .append(Config.getOutPutDelimited());
        }
        sb.delete(sb.length() - 1, sb.length());
        return sb.toString();
    }

}

在输出到HBase的类型,由于我只针对特定的文件,做了一个简单的rowKey设计.当然也可以定一个rowKey的设计接口.比如:

package com.yss.dbf.out;

import com.yss.dbf.Field;

public interface HBaseRowKey {
    String genRowKey(Field[] fields);
}

以上就是大概的整个DBF文件解析,并存储到HBase的过程.由于直接存储到HBase,还是太慢,所以我想着就是使用MapReduce生成HFile,然后使用命令加载将HFile加载到HBase,这样按理来说,应该会快很多.

MR之Map端代码如下:

package com.yss.mr;

import com.yss.dbf.Config;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

@SuppressWarnings("all")
public class WriteToHFile extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {

    private String[] colums;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        colums = context.getConfiguration().get("colums.name").split(Config.getOutPutDelimited());
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] fields = value.toString().split(Config.getOutPutDelimited());
        String ROWKEY = genRowKey(fields[3], fields[0], fields[1], fields[2]);
        Put put = new Put(Bytes.toBytes(ROWKEY));
        for (int i = 4; i < colums.length; i++) {
            put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes(colums[i]), Bytes.toBytes(fields[i]));
        }
        ImmutableBytesWritable rowkey = new ImmutableBytesWritable(Bytes.toBytes(ROWKEY));
        context.write(rowkey, put);
    }

    private String genRowKey(String... fields) {
        StringBuilder sb = new StringBuilder();
        for (String field : fields) {
            sb.append(field)
                    .append(Config.getOutPutDelimited());
        }
        sb.delete(sb.length() - 1, sb.length());
        return sb.toString();
    }
}

MR之Driver端代码:

package com.yss.mr;

import com.yss.dbf.Config;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

@SuppressWarnings("all")
public class Driver implements Tool {
    private final static String INPUT_PATH = "file:///Users/lijiayan/Desktop/tempDir/";
    private final static String OUTPUT_PATH = "hdfs://host:port/tmp/ljy/out/";

    private Configuration conf;

    @Override
    public int run(String[] strings) throws Exception {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("ktr"));
        Job job = Job.getInstance(conf);
        job.setJarByClass(Driver.class);
        job.setMapperClass(WriteToHFile.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);
        job.setOutputFormatClass(HFileOutputFormat2.class);
        HFileOutputFormat2.configureIncrementalLoad(job, table, connection.getRegionLocator(TableName.valueOf("ktr")));
        FileInputFormat.addInputPath(job, new Path(INPUT_PATH));
        FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
        return job.waitForCompletion(true) ? 0 : 1;
    }

    @Override
    public void setConf(Configuration configuration) {
        configuration.set("colums.name", genColums(
                "GDDM",
                "GDXM",
                "BCRQ",
                "CJBH",
                "GSDM",
                "CJSL",
                "BCYE",
                "ZQDM",
                "SBSJ",
                "CJSJ",
                "CJJG",
                "CJJE",
                "SQBH",
                "BS",
                "MJBH"
        ));
        configuration.set("hbase.zookeeper.property.clientPort", "2181");
        configuration.set("hbase.zookeeper.quorum", "ip");
        configuration.set("zookeeper.znode.parent", "/hbase-unsecure");
        configuration.set("hbase.fs.tmp.dir","hdfs://ip:port/tmp/ljy/temp/");
        this.conf = configuration;
    }

    @Override
    public Configuration getConf() {
        return conf;
    }

    private String genColums(String... fields) {
        StringBuilder sb = new StringBuilder();
        for (String field : fields) {
            sb.append(field)
                    .append(Config.getOutPutDelimited());
        }
        sb.delete(sb.length() - 1, sb.length());
        return sb.toString();
    }

    public static void main(String[] args) {
        try {
            Driver driver = new Driver();
            if (ToolRunner.run(driver, null) == 0) {//解析完成后,使用如下代码将HFile加载到HBase
                Connection connection = ConnectionFactory.createConnection(driver.conf);
                Admin admin = connection.getAdmin();
                Table table = connection.getTable(TableName.valueOf("ktr"));
                LoadIncrementalHFiles load = new LoadIncrementalHFiles(driver.conf);
                load.doBulkLoad(new Path(OUTPUT_PATH), admin, table, connection.getRegionLocator(TableName.valueOf("ktr")));
            } else {
                System.out.println("运行错误");
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

}

End

好了,大概就是这样.主要代码注释都很清楚.由于结果不是太理想,所以这个过程还处于测试阶段.并没有上线使用.一方面记录多线程读取DBF文件的处理方式,当然通过这个方式,也可以拓展到多线程读取大文件的方式.处理逻辑大同小异.另一方便,记录一下写数据到HBase的方式.再一个就是通过MR生成HFile,然后加载HFile到HBase的方式.
若有借鉴者,不懂的地方,欢迎交流.
完整的代码已经上传完整的代码.
后面会上传到github,有需要的也可以 lijiayan_mail@163.com