MapReduce源码之DBInputFormat

最新推荐文章于 2021-04-09 09:20:34 发布

weixin_34327223

最新推荐文章于 2021-04-09 09:20:34 发布

阅读量157

点赞数

文章标签：大数据

原文链接：https://my.oschina.net/yulongblog/blog/1506237

版权

2019独角兽企业重金招聘Python工程师标准>>>

DBInputFormat是读取关系型数据库数据表的一种实现。这一部分也是Sqoop从关系型数据库将数据导入HDFS的实现。

/**
 * A InputFormat that reads input data from an SQL table.
 * <p>
 * DBInputFormat emits LongWritables containing the record number as 
 * key and DBWritables as value. 
 * 
 * The SQL query, and input class can be using one of the two 
 * setInput methods.
 */

DBInputFormat.getSplits方法主要作用是通过获取所查表的记录数count(可以增加where条件)，然后根据设置的map数目，计算记录数除以map数目来得到分片。关于哪个分片获取哪些数目在DBInputFormat.createDBRecordReader方法中在谈。

public List<InputSplit> getSplits(JobContext job) throws IOException {

  ResultSet results = null;  
  Statement statement = null;
  try {
    statement = connection.createStatement();

    results = statement.executeQuery(getCountQuery());//这里计算的是所有的记录数
    results.next();

    long count = results.getLong(1);
    int chunks = job.getConfiguration().getInt(MRJobConfig.NUM_MAPS, 1);//程序设置的map数
    long chunkSize = (count / chunks);//这里获取分片数

    results.close();
    statement.close();

    List<InputSplit> splits = new ArrayList<InputSplit>();

    // Split the rows into n-number of chunks and adjust the last chunk
    // accordingly
    for (int i = 0; i < chunks; i++) {
      DBInputSplit split;

      if ((i + 1) == chunks)//这里根据第几个分片应该获取哪些数据。举例:50条数据，3个map，则产生的分片为1到20，21到40，41到50。
        split = new DBInputSplit(i * chunkSize, count);
      else
        split = new DBInputSplit(i * chunkSize, (i * chunkSize)
            + chunkSize);

      splits.add(split);
    }

    connection.commit();
    return splits;
  } catch (SQLException e) {
    throw new IOException("Got SQLException", e);
  } finally {
    try {
      if (results != null) { results.close(); }
    } catch (SQLException e1) {}
    try {
      if (statement != null) { statement.close(); }
    } catch (SQLException e1) {}

    closeConnection();
  }
}

DBInputFormat.createDBRecordReader方法

protected RecordReader<LongWritable, T> createDBRecordReader(DBInputSplit split,
    Configuration conf) throws IOException {

  @SuppressWarnings("unchecked")
  Class<T> inputClass = (Class<T>) (dbConf.getInputClass());//根据mapreduce.jdbc.input.class参数设置来判断oracle还是mysql数据库,使用OracleDBRecordReader和MySQLDBRecordReader来，MySQLDBRecordReader和OracleDBRecordReader是DBRecordReader的子类。
  try {
    // use database product name to determine appropriate record reader.
    if (dbProductName.startsWith("ORACLE")) {
      // use Oracle-specific db reader.
      return new OracleDBRecordReader<T>(split, inputClass,
          conf, createConnection(), getDBConf(), conditions, fieldNames,
          tableName);
    } else if (dbProductName.startsWith("MYSQL")) {
      // use MySQL-specific db reader.
      return new MySQLDBRecordReader<T>(split, inputClass,
          conf, createConnection(), getDBConf(), conditions, fieldNames,
          tableName);
    } else {
      // Generic reader.
      return new DBRecordReader<T>(split, inputClass,
          conf, createConnection(), getDBConf(), conditions, fieldNames,
          tableName);
    }
  } catch (SQLException ex) {
    throw new IOException(ex.getMessage());
  }
}

DBRecordReader.nextKeyValue方法

public boolean nextKeyValue() throws IOException {
  try {
    if (key == null) {
      key = new LongWritable();
    }
    if (value == null) {
      value = createValue();
    }
    if (null == this.results) {
      // First time into this method, run the query.
      this.results = executeQuery(getSelectQuery());//返回的结果集都是java.sql.ResultSet,但是执行的查询语句可能是不同的，可以通过覆写getSelectQuery来定制不同的关系型数据库
    }
    if (!results.next())
      return false;

    // Set the key field value as the output key value
    key.set(pos + split.getStart());

    value.readFields(results);

    pos ++;
  } catch (SQLException e) {
    throw new IOException("SQLException in nextKeyValue", e);
  }
  return true;
}

DBRecordReader.getSelectQuery方法，是以Mysql数据库为例实现。子类OracleDBRecordReader覆写了这个方法，调整了SQL语句。之前的问题：如何保证各个分片能取到相应分片的数据，不重复也不缺少。解决办法：将记录按照某个字段进行排序，然后根据每个split的start，stop来进行获取，前提是数据没有在这段时间内进行改动，如果数据存在改动，那就比较悲剧了。

/** Returns the query for selecting the records from an Oracle DB. */
protected String getSelectQuery() {
  StringBuilder query = new StringBuilder();
  DBConfiguration dbConf = getDBConf();
  String conditions = getConditions();
  String tableName = getTableName();
  String [] fieldNames = getFieldNames();

  // Oracle-specific codepath to use rownum instead of LIMIT/OFFSET.
  if(dbConf.getInputQuery() == null) {
    query.append("SELECT ");

    for (int i = 0; i < fieldNames.length; i++) {
      query.append(fieldNames[i]);
      if (i != fieldNames.length -1) {
        query.append(", ");
      }
    }

    query.append(" FROM ").append(tableName);
    if (conditions != null && conditions.length() > 0)
      query.append(" WHERE ").append(conditions);
    String orderBy = dbConf.getInputOrderBy();//通过mapreduce.jdbc.input.orderby获取排序字段
    if (orderBy != null && orderBy.length() > 0) {
      query.append(" ORDER BY ").append(orderBy);//查询的记录排序输出
    }
  } else {
    //PREBUILT QUERY
    query.append(dbConf.getInputQuery());
  }
      
  try {
    DBInputFormat.DBInputSplit split = getSplit();
    if (split.getLength() > 0){
      String querystring = query.toString();

      query = new StringBuilder();
      query.append("SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( ");
      query.append(querystring);
      query.append(" ) a WHERE rownum <= ").append(split.getEnd());//根据之前的split的stop设置记录范围最大值
      query.append(" ) WHERE dbif_rno > ").append(split.getStart());//根据之前的split的start设置记录范围最小值
    }
  } catch (IOException ex) {
    // ignore, will not throw.
  }             

  return query.toString();
}

如果需要输出数据，还需要自己实现DBWritable接口，NullDBWritable作为自带的基类，4个实现方法什么都没有作，我们至少需要覆写readFields方法才能得到我们需要的结果集。

public static class NullDBWritable implements DBWritable, Writable {
    @Override
    public void readFields(DataInput in) throws IOException { }
    @Override
    public void readFields(ResultSet arg0) throws SQLException { }
    @Override
    public void write(DataOutput out) throws IOException { }
    @Override
    public void write(PreparedStatement arg0) throws SQLException { }
  }

DBConfiguration类中参数,大部分参数需要在job中进行指定。这里的参数可以使用DBConfiguration.configureDB()方法和DBInputFormat.setInput()方法在job里进行设置。

/** The JDBC Driver class name */
public static final String DRIVER_CLASS_PROPERTY =  "mapreduce.jdbc.driver.class";//数据库连接驱动

/** JDBC Database access URL */
public static final String URL_PROPERTY = "mapreduce.jdbc.url";//数据库连接url

/** User name to access the database */
public static final String USERNAME_PROPERTY = "mapreduce.jdbc.username";//用户名

/** Password to access the database */
public static final String PASSWORD_PROPERTY = "mapreduce.jdbc.password";//密码

/** Input table name */
public static final String INPUT_TABLE_NAME_PROPERTY = "mapreduce.jdbc.input.table.name";//表名

/** Field names in the Input table */
public static final String INPUT_FIELD_NAMES_PROPERTY = "mapreduce.jdbc.input.field.names";//导入字段

/** WHERE clause in the input SELECT statement */
public static final String INPUT_CONDITIONS_PROPERTY = "mapreduce.jdbc.input.conditions";//where条件

/** ORDER BY clause in the input SELECT statement */
public static final String INPUT_ORDER_BY_PROPERTY = "mapreduce.jdbc.input.orderby";//排序字段

/** Whole input query, exluding LIMIT...OFFSET */
public static final String INPUT_QUERY = "mapreduce.jdbc.input.query";//如果设置将忽略SQL语句的拼接，也就是根据上面四个字段拼接成的SQL,这里不增加column的范围，例子：select * from table where conditions order by column.

/** Input query to get the count of records */
public static final String INPUT_COUNT_QUERY = "mapreduce.jdbc.input.count.query";//设置查询记录数，如果设置将忽略一次数据库的count操作

/** Input query to get the max and min values of the jdbc.input.query */
public static final String INPUT_BOUNDING_QUERY = "mapred.jdbc.input.bounding.query";//这里增加column的范围

/** Class name implementing DBWritable which will hold input tuples */
public static final String INPUT_CLASS_PROPERTY = "mapreduce.jdbc.input.class";//输入key/value中value的类型，接口DBWritable的实现

/** Output table name */
public static final String OUTPUT_TABLE_NAME_PROPERTY = "mapreduce.jdbc.output.table.name";//输出的表名

/** Field names in the Output table */
public static final String OUTPUT_FIELD_NAMES_PROPERTY = "mapreduce.jdbc.output.field.names";  //输出表字段

/** Number of fields in the Output table */
public static final String OUTPUT_FIELD_COUNT_PROPERTY = "mapreduce.jdbc.output.field.count";  //输出表记录数

问题思考：由于从关系型数据库的数据导入，设置map数mapnum，在执行数据导入的时候将产生mapnum次order排序，对于导入数据量比较大的时候，这将导致数据库负载非常高。数据库的负载、数据库与Hadoop集群之间的网络负载都可能成为数据导入的瓶颈。对比先从数据库导出数据文件，再导入HDFS的方式，一般数据库自带导出工具效率更高，数据库负载小，hadoop集群不需要mapreduce占用集群资源，一定程度上可能整体效率更高。

改进思路：1、关系型数据库设置分区，可以是范围分区或者hash分区；2、重写DBInputFormat的子类，根据分区个数设置map数目，取消sort过程，每个map单独取某个分区的数据，一定程度上能提升不少效率，也能减轻关系型数据库的负载。

转载于:https://my.oschina.net/yulongblog/blog/1506237