MapReduce源码之DBInputFormat

DBInputFormat是读取关系型数据库数据表的一种实现。这一部分也是Sqoop从关系型数据库将数据导入HDFS的实现。

/**
 * A InputFormat that reads input data from an SQL table.
 * <p>
 * DBInputFormat emits LongWritables containing the record number as 
 * key and DBWritables as value. 
 * 
 * The SQL query, and input class can be using one of the two 
 * setInput methods.
 */

DBInputFormat.getSplits方法主要作用是通过获取所查表的记录数count(可以增加where条件),然后根据设置的map数目,计算记录数除以map数目来得到分片。关于哪个分片获取哪些数目在DBInputFormat.createDBRecordReader方法中在谈。

public List<InputSplit> getSplits(JobContext job) throws IOException {

  ResultSet results = null;  
  Statement statement = null;
  try {
    statement = connection.createStatement();

    results = statement.executeQuery(getCountQuery());//这里计算的是所有的记录数
    results.next();

    long count = results.getLong(1);
    int chunks = job.getConfiguration().getInt(MRJobConfig.NUM_MAPS, 1);//程序设置的map数
    long chunkSize = (count / chunks);//这里获取分片数

    results.close();
    statement.close();

    List<InputSplit> splits = new ArrayList<InputSplit>();

    // Split the rows into n-number of chunks and adjust the last chunk
    // accordingly
    for (int i = 0; i < chunks; i++) {
      DBInputSplit split;

      if ((i + 1) == chunks)//这里根据第几个分片应该获取哪些数据。举例:50条数据,3个map,则产生的分片为1到20,21到40,41到50。
        split = new DBInputSplit(i * chunkSize, count);
      else
        split = new DBInputSplit(i * chunkSize, (i * chunkSize)
            + chunkSize);

      splits.add(split);
    }

    connection.commit();
    return splits;
  } catch (SQLException e) {
    throw new IOException("Got SQLException", e);
  } finally {
    try {
      if (results != null) { results.close(); }
    } catch (SQLException e1) {}
    try {
      if (statement != null) { statement.close(); }
    } catch (SQLException e1) {}

    closeConnection();
  }
}

 

DBInputFormat.createDBRecordReader方法
protected RecordReader<LongWritable, T> createDBRecordReader(DBInputSplit split,
    Configuration conf) throws IOException {

  @SuppressWarnings("unchecked")
  Class<T> inputClass = (Class<T>) (dbConf.getInputClass());//根据mapreduce.jdbc.input.class参数设置来判断oracle还是mysql数据库,使用OracleDBRecordReader和MySQLDBRecordReader来,MySQLDBRecordReader和OracleDBRecordReader是DBRecordReader的子类。
  try {
    // use database product name to determine appropriate record reader.
    if (dbProductName.startsWith("ORACLE")) {
      // use Oracle-specific db reader.
      return new OracleDBRecordReader<T>(split, inputClass,
          conf, createConnection(), getDBConf(), conditions, fieldNames,
          tableName);
    } else if (dbProductName.startsWith("MYSQL")) {
      // use MySQL-specific db reader.
      return new MySQLDBRecordReader<T>(split, inputClass,
          conf, createConnection(), getDBConf(), conditions, fieldNames,
          tableName);
    } else {
      // Generic reader.
      return new DBRecordReader<T>(split, inputClass,
          conf, createConnection(), getDBConf(), conditions, fieldNames,
          tableName);
    }
  } catch (SQLException ex) {
    throw new IOException(ex.getMessage());
  }
}

 

DBRecordReader.nextKeyValue方法
public boolean nextKeyValue() throws IOException {
  try {
    if (key == null) {
      key = new LongWritable();
    }
    if (value == null) {
      value = createValue();
    }
    if (null == this.results) {
      // First time into this method, run the query.
      this.results = executeQuery(getSelectQuery());//返回的结果集都是java.sql.ResultSet,但是执行的查询语句可能是不同的,可以通过覆写getSelectQuery来定制不同的关系型数据库
    }
    if (!results.next())
      return false;

    // Set the key field value as the output key value
    key.set(pos + split.getStart());

    value.readFields(results);

    pos ++;
  } catch (SQLException e) {
    throw new IOException("SQLException in nextKeyValue", e);
  }
  return true;
}

DBRecordReader.getSelectQuery方法,是以Mysql数据库为例实现。子类OracleDBRecordReader覆写了这个方法,调整了SQL语句。之前的问题:如何保证各个分片能取到相应分片的数据,不重复也不缺少。解决办法:将记录按照某个字段进行排序,然后根据每个split的start,stop来进行获取,前提是数据没有在这段时间内进行改动,如果数据存在改动,那就比较悲剧了。

/** Returns the query for selecting the records from an Oracle DB. */
protected String getSelectQuery() {
  StringBuilder query = new StringBuilder();
  DBConfiguration dbConf = getDBConf();
  String conditions = getConditions();
  String tableName = getTableName();
  String [] fieldNames = getFieldNames();

  // Oracle-specific codepath to use rownum instead of LIMIT/OFFSET.
  if(dbConf.getInputQuery() == null) {
    query.append("SELECT ");

    for (int i = 0; i < fieldNames.length; i++) {
      query.append(fieldNames[i]);
      if (i != fieldNames.length -1) {
        query.append(", ");
      }
    }

    query.append(" FROM ").append(tableName);
    if (conditions != null && conditions.length() > 0)
      query.append(" WHERE ").append(conditions);
    String orderBy = dbConf.getInputOrderBy();//通过mapreduce.jdbc.input.orderby获取排序字段
    if (orderBy != null && orderBy.length() > 0) {
      query.append(" ORDER BY ").append(orderBy);//查询的记录排序输出
    }
  } else {
    //PREBUILT QUERY
    query.append(dbConf.getInputQuery());
  }
      
  try {
    DBInputFormat.DBInputSplit split = getSplit();
    if (split.getLength() > 0){
      String querystring = query.toString();

      query = new StringBuilder();
      query.append("SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( ");
      query.append(querystring);
      query.append(" ) a WHERE rownum <= ").append(split.getEnd());//根据之前的split的stop设置记录范围最大值
      query.append(" ) WHERE dbif_rno > ").append(split.getStart());//根据之前的split的start设置记录范围最小值
    }
  } catch (IOException ex) {
    // ignore, will not throw.
  }             

  return query.toString();
}

 

如果需要输出数据,还需要自己实现DBWritable接口,NullDBWritable作为自带的基类,4个实现方法什么都没有作,我们至少需要覆写readFields方法才能得到我们需要的结果集。

public static class NullDBWritable implements DBWritable, Writable {
    @Override
    public void readFields(DataInput in) throws IOException { }
    @Override
    public void readFields(ResultSet arg0) throws SQLException { }
    @Override
    public void write(DataOutput out) throws IOException { }
    @Override
    public void write(PreparedStatement arg0) throws SQLException { }
  }

 

DBConfiguration类中参数,大部分参数需要在job中进行指定。这里的参数可以使用DBConfiguration.configureDB()方法和DBInputFormat.setInput()方法在job里进行设置。
/** The JDBC Driver class name */
public static final String DRIVER_CLASS_PROPERTY =  "mapreduce.jdbc.driver.class";//数据库连接驱动

/** JDBC Database access URL */
public static final String URL_PROPERTY = "mapreduce.jdbc.url";//数据库连接url

/** User name to access the database */
public static final String USERNAME_PROPERTY = "mapreduce.jdbc.username";//用户名

/** Password to access the database */
public static final String PASSWORD_PROPERTY = "mapreduce.jdbc.password";//密码

/** Input table name */
public static final String INPUT_TABLE_NAME_PROPERTY = "mapreduce.jdbc.input.table.name";//表名

/** Field names in the Input table */
public static final String INPUT_FIELD_NAMES_PROPERTY = "mapreduce.jdbc.input.field.names";//导入字段

/** WHERE clause in the input SELECT statement */
public static final String INPUT_CONDITIONS_PROPERTY = "mapreduce.jdbc.input.conditions";//where条件

/** ORDER BY clause in the input SELECT statement */
public static final String INPUT_ORDER_BY_PROPERTY = "mapreduce.jdbc.input.orderby";//排序字段

/** Whole input query, exluding LIMIT...OFFSET */
public static final String INPUT_QUERY = "mapreduce.jdbc.input.query";//如果设置将忽略SQL语句的拼接,也就是根据上面四个字段拼接成的SQL,这里不增加column的范围,例子:select * from table where conditions order by column.

/** Input query to get the count of records */
public static final String INPUT_COUNT_QUERY = "mapreduce.jdbc.input.count.query";//设置查询记录数,如果设置将忽略一次数据库的count操作

/** Input query to get the max and min values of the jdbc.input.query */
public static final String INPUT_BOUNDING_QUERY = "mapred.jdbc.input.bounding.query";//这里增加column的范围

/** Class name implementing DBWritable which will hold input tuples */
public static final String INPUT_CLASS_PROPERTY = "mapreduce.jdbc.input.class";//输入key/value中value的类型,接口DBWritable的实现

/** Output table name */
public static final String OUTPUT_TABLE_NAME_PROPERTY = "mapreduce.jdbc.output.table.name";//输出的表名

/** Field names in the Output table */
public static final String OUTPUT_FIELD_NAMES_PROPERTY = "mapreduce.jdbc.output.field.names";  //输出表字段

/** Number of fields in the Output table */
public static final String OUTPUT_FIELD_COUNT_PROPERTY = "mapreduce.jdbc.output.field.count";  //输出表记录数

 

 

 

问题思考:由于从关系型数据库的数据导入,设置map数mapnum,在执行数据导入的时候将产生mapnum次order排序,对于导入数据量比较大的时候,这将导致数据库负载非常高。数据库的负载、数据库与Hadoop集群之间的网络负载都可能成为数据导入的瓶颈。对比先从数据库导出数据文件,再导入HDFS的方式,一般数据库自带导出工具效率更高,数据库负载小,hadoop集群不需要mapreduce占用集群资源,一定程度上可能整体效率更高。

改进思路:1、关系型数据库设置分区,可以是范围分区或者hash分区;2、重写DBInputFormat的子类,根据分区个数设置map数目,取消sort过程,每个map单独取某个分区的数据,一定程度上能提升不少效率,也能减轻关系型数据库的负载。

 

 

转载于:https://my.oschina.net/yulongblog/blog/1506237

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值