DBInputFormat是读取关系型数据库数据表的一种实现。这一部分也是Sqoop从关系型数据库将数据导入HDFS的实现。
/** * A InputFormat that reads input data from an SQL table. * <p> * DBInputFormat emits LongWritables containing the record number as * key and DBWritables as value. * * The SQL query, and input class can be using one of the two * setInput methods. */ |
DBInputFormat.getSplits方法主要作用是通过获取所查表的记录数count(可以增加where条件),然后根据设置的map数目,计算记录数除以map数目来得到分片。关于哪个分片获取哪些数目在DBInputFormat.createDBRecordReader方法中在谈。
public List<InputSplit> getSplits(JobContext job) throws IOException { ResultSet results = null; Statement statement = null; try { statement = connection.createStatement(); results = statement.executeQuery(getCountQuery());//这里计算的是所有的记录数 results.next(); long count = results.getLong(1); int chunks = job.getConfiguration().getInt(MRJobConfig.NUM_MAPS, 1);//程序设置的map数 long chunkSize = (count / chunks);//这里获取分片数 results.close(); statement.close(); List<InputSplit> splits = new ArrayList<InputSplit>(); // Split the rows into n-number of chunks and adjust the last chunk // accordingly for (int i = 0; i < chunks; i++) { DBInputSplit split; if ((i + 1) == chunks)//这里根据第几个分片应该获取哪些数据。举例:50条数据,3个map,则产生的分片为1到20,21到40,41到50。 split = new DBInputSplit(i * chunkSize, count); else split = new DBInputSplit(i * chunkSize, (i * chunkSize) + chunkSize); splits.add(split); } connection.commit(); return splits; } catch (SQLException e) { throw new IOException("Got SQLException", e); } finally { try { if (results != null) { results.close(); } } catch (SQLException e1) {} try { if (statement != null) { statement.close(); } } catch (SQLException e1) {} closeConnection(); } } |
DBInputFormat.createDBRecordReader方法
protected RecordReader<LongWritable, T> createDBRecordReader(DBInputSplit split, Configuration conf) throws IOException { @SuppressWarnings("unchecked") Class<T> inputClass = (Class<T>) (dbConf.getInputClass());//根据mapreduce.jdbc.input.class参数设置来判断oracle还是mysql数据库,使用OracleDBRecordReader和MySQLDBRecordReader来,MySQLDBRecordReader和OracleDBRecordReader是DBRecordReader的子类。 try { // use database product name to determine appropriate record reader. if (dbProductName.startsWith("ORACLE")) { // use Oracle-specific db reader. return new OracleDBRecordReader<T>(split, inputClass, conf, createConnection(), getDBConf(), conditions, fieldNames, tableName); } else if (dbProductName.startsWith("MYSQL")) { // use MySQL-specific db reader. return new MySQLDBRecordReader<T>(split, inputClass, conf, createConnection(), getDBConf(), conditions, fieldNames, tableName); } else { // Generic reader. return new DBRecordReader<T>(split, inputClass, conf, createConnection(), getDBConf(), conditions, fieldNames, tableName); } } catch (SQLException ex) { throw new IOException(ex.getMessage()); } } |
DBRecordReader.nextKeyValue方法
public boolean nextKeyValue() throws IOException { try { if (key == null) { key = new LongWritable(); } if (value == null) { value = createValue(); } if (null == this.results) { // First time into this method, run the query. this.results = executeQuery(getSelectQuery());//返回的结果集都是java.sql.ResultSet,但是执行的查询语句可能是不同的,可以通过覆写getSelectQuery来定制不同的关系型数据库 } if (!results.next()) return false; // Set the key field value as the output key value key.set(pos + split.getStart()); value.readFields(results); pos ++; } catch (SQLException e) { throw new IOException("SQLException in nextKeyValue", e); } return true; } |
DBRecordReader.getSelectQuery方法,是以Mysql数据库为例实现。子类OracleDBRecordReader覆写了这个方法,调整了SQL语句。之前的问题:如何保证各个分片能取到相应分片的数据,不重复也不缺少。解决办法:将记录按照某个字段进行排序,然后根据每个split的start,stop来进行获取,前提是数据没有在这段时间内进行改动,如果数据存在改动,那就比较悲剧了。
/** Returns the query for selecting the records from an Oracle DB. */ protected String getSelectQuery() { StringBuilder query = new StringBuilder(); DBConfiguration dbConf = getDBConf(); String conditions = getConditions(); String tableName = getTableName(); String [] fieldNames = getFieldNames(); // Oracle-specific codepath to use rownum instead of LIMIT/OFFSET. if(dbConf.getInputQuery() == null) { query.append("SELECT "); for (int i = 0; i < fieldNames.length; i++) { query.append(fieldNames[i]); if (i != fieldNames.length -1) { query.append(", "); } } query.append(" FROM ").append(tableName); if (conditions != null && conditions.length() > 0) query.append(" WHERE ").append(conditions); String orderBy = dbConf.getInputOrderBy();//通过mapreduce.jdbc.input.orderby获取排序字段 if (orderBy != null && orderBy.length() > 0) { query.append(" ORDER BY ").append(orderBy);//查询的记录排序输出 } } else { //PREBUILT QUERY query.append(dbConf.getInputQuery()); } try { DBInputFormat.DBInputSplit split = getSplit(); if (split.getLength() > 0){ String querystring = query.toString(); query = new StringBuilder(); query.append("SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( "); query.append(querystring); query.append(" ) a WHERE rownum <= ").append(split.getEnd());//根据之前的split的stop设置记录范围最大值 query.append(" ) WHERE dbif_rno > ").append(split.getStart());//根据之前的split的start设置记录范围最小值 } } catch (IOException ex) { // ignore, will not throw. } return query.toString(); } |
如果需要输出数据,还需要自己实现DBWritable接口,NullDBWritable作为自带的基类,4个实现方法什么都没有作,我们至少需要覆写readFields方法才能得到我们需要的结果集。
public static class NullDBWritable implements DBWritable, Writable { @Override public void readFields(DataInput in) throws IOException { } @Override public void readFields(ResultSet arg0) throws SQLException { } @Override public void write(DataOutput out) throws IOException { } @Override public void write(PreparedStatement arg0) throws SQLException { } } |
DBConfiguration类中参数,大部分参数需要在job中进行指定。这里的参数可以使用DBConfiguration.configureDB()方法和DBInputFormat.setInput()方法在job里进行设置。
/** The JDBC Driver class name */ public static final String DRIVER_CLASS_PROPERTY = "mapreduce.jdbc.driver.class";//数据库连接驱动 /** JDBC Database access URL */ public static final String URL_PROPERTY = "mapreduce.jdbc.url";//数据库连接url /** User name to access the database */ public static final String USERNAME_PROPERTY = "mapreduce.jdbc.username";//用户名 /** Password to access the database */ public static final String PASSWORD_PROPERTY = "mapreduce.jdbc.password";//密码 /** Input table name */ public static final String INPUT_TABLE_NAME_PROPERTY = "mapreduce.jdbc.input.table.name";//表名 /** Field names in the Input table */ public static final String INPUT_FIELD_NAMES_PROPERTY = "mapreduce.jdbc.input.field.names";//导入字段 /** WHERE clause in the input SELECT statement */ public static final String INPUT_CONDITIONS_PROPERTY = "mapreduce.jdbc.input.conditions";//where条件 /** ORDER BY clause in the input SELECT statement */ public static final String INPUT_ORDER_BY_PROPERTY = "mapreduce.jdbc.input.orderby";//排序字段 /** Whole input query, exluding LIMIT...OFFSET */ public static final String INPUT_QUERY = "mapreduce.jdbc.input.query";//如果设置将忽略SQL语句的拼接,也就是根据上面四个字段拼接成的SQL,这里不增加column的范围,例子:select * from table where conditions order by column. /** Input query to get the count of records */ public static final String INPUT_COUNT_QUERY = "mapreduce.jdbc.input.count.query";//设置查询记录数,如果设置将忽略一次数据库的count操作 /** Input query to get the max and min values of the jdbc.input.query */ public static final String INPUT_BOUNDING_QUERY = "mapred.jdbc.input.bounding.query";//这里增加column的范围 /** Class name implementing DBWritable which will hold input tuples */ public static final String INPUT_CLASS_PROPERTY = "mapreduce.jdbc.input.class";//输入key/value中value的类型,接口DBWritable的实现 /** Output table name */ public static final String OUTPUT_TABLE_NAME_PROPERTY = "mapreduce.jdbc.output.table.name";//输出的表名 /** Field names in the Output table */ public static final String OUTPUT_FIELD_NAMES_PROPERTY = "mapreduce.jdbc.output.field.names"; //输出表字段 /** Number of fields in the Output table */ public static final String OUTPUT_FIELD_COUNT_PROPERTY = "mapreduce.jdbc.output.field.count"; //输出表记录数 |
问题思考:由于从关系型数据库的数据导入,设置map数mapnum,在执行数据导入的时候将产生mapnum次order排序,对于导入数据量比较大的时候,这将导致数据库负载非常高。数据库的负载、数据库与Hadoop集群之间的网络负载都可能成为数据导入的瓶颈。对比先从数据库导出数据文件,再导入HDFS的方式,一般数据库自带导出工具效率更高,数据库负载小,hadoop集群不需要mapreduce占用集群资源,一定程度上可能整体效率更高。
改进思路:1、关系型数据库设置分区,可以是范围分区或者hash分区;2、重写DBInputFormat的子类,根据分区个数设置map数目,取消sort过程,每个map单独取某个分区的数据,一定程度上能提升不少效率,也能减轻关系型数据库的负载。