SQOOP导入数据导致DB2行顺序与HIVE中不同,问题分析

最新推荐文章于 2024-02-22 19:22:52 发布

奋斗鸭，鸭子的鸭

最新推荐文章于 2024-02-22 19:22:52 发布

阅读量1.2k

点赞数 1

分类专栏： SQOOP源码修改记录

本文链接：https://blog.csdn.net/chenwei945/article/details/81094868

版权

在项目中，使用SQOOP从DB2迁移数据到HIVE时，发现行顺序不一致。问题源于SQOOP在importTable过程中，当未指定`--split-by`参数时，默认使用了DB2的第一个主键进行排序，而DB2本身是按联合主键排序。因此，迁移后行顺序发生变化。解决方法是明确指定`--split-by`与DB2的联合主键相对应的字段。

摘要由CSDN通过智能技术生成

最近项目中遇到了使用SQOOP数据迁移时，HIVE中数据的行顺序与DB2中行顺序不同，百思不得其解（使用--table模式）直接看SQOOP代码。

进入源码发现了Sqoop的import代码都在org.apache.sqoop.tool.importTool.java下.

实际执行importTable的代码如下.

 protected boolean importTable(SqoopOptions options, String tableName,
      HiveImport hiveImport) throws IOException, ImportException {
    String jarFile = null;

    // Generate the ORM code for the tables.
    jarFile = codeGenerator.generateORM(options, tableName);

    Path outputPath = getOutputPath(options, tableName);

    // Do the actual import.
    ImportJobContext context = new ImportJobContext(tableName, jarFile,
        options, outputPath);

    // If we're doing an incremental import, set up the
    // filtering conditions used to get the latest records.
    if (!initIncrementalConstraints(options, context)) {
      return false;
    }

    if (options.isDeleteMode()) {
      deleteTargetDir(context);
    }

    if (null != tableName) {
      manager.importTable(context);
    } else {
      manager.importQuery(context);
    }

    if (options.isAppendMode()) {
      AppendUtils app = new AppendUtils(context);
      app.append();
    } else if (options.getIncrementalMode() == SqoopOptions.IncrementalMode.DateLastModified) {
      lastModifiedMerge(options, context);
    }

    // If the user wants this table to be in Hive, perform that post-load.
    if (options.doHiveImport()) {
      // For Parquet file, the import action will create hive table directly via
      // kite. So there is no need to do hive import as a post step again.
      if (options.getFileLayout() != SqoopOptions.FileLayout.ParquetFile) {
        hiveImport.importTable(tableName, options.getHiveTableName(), false);
      }
    }

    saveIncrementalState(options);

    return true;
  }

逻辑：1.生成用于orm的java类文件这个类文件时通过classWriter生成的，用于对应每一列的类型与每一列的名字(这里不详细深究，问题没有出在这)的 read和write方法。这里可以对数据进行去空格啊等预处理.

2.进入manage.importTable(context);这一方法的目的是生成hdfs文件！实际调用的是org.apache.sqoop.manager.SqlManager.importTable();

详细的进入方法如下:

  /**
   * Default implementation of importTable() is to launch a MapReduce job
   * via DataDrivenImportJob to read the table with DataDrivenDBInputFormat.
   */
  public void importTable(com.cloudera.sqoop.manager.ImportJobContext context)
      throws IOException, ImportException {
    String tableName = context.getTableName();
    String jarFile = context.getJarFile();
    SqoopOptions opts = context.getOptions();

    context.setConnManager(this);

    ImportJobBase importer;
    if (opts.getHBaseTable() != null) {
      // Import to HBase.
      i

最低0.47元/天解锁文章

奋斗鸭，鸭子的鸭

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
3
评论
SQOOP导入数据导致DB2行顺序与HIVE中不同,问题分析

最近项目中遇到了使用SQOOP数据迁移时，HIVE中数据的行顺序与DB2中行顺序不同，百思不得其解（使用--table模式）直接看SQOOP代码。进入源码发现了Sqoop的import代码都在org.apache.sqoop.tool.importTool.java下.实际执行importTable的代码如下. protected boolean importTable(Sqoop...
复制链接

扫一扫

专栏目录