资料收集_收集资料-CSDN博客

资料收集

一点背景

如今，有关Spark的文献非常丰富，似乎我需要花更多时间谈论这位叫Informix®的老太太。 Informix于1980年诞生于关系数据库系统（RDS），并Swift成为UNIX系统上的参考关系数据库管理系统（RDBMS）。 IBM分两个阶段（2001年和2005年）收购了Informix公司，将同名数据库添加到了令人印象深刻的数据管理产品组合中。幸运的是，有人会说，由于Informix拥有高度忠诚和积极进取的用户群和用户群，因此在IBM产品组合中一直保持活跃，为产品带来了许多创新，例如XML和JSON支持，NoSQL集成，使其成为第一个企业混合数据库等。 Informix是用于事务处理应用程序的出色数据库，并且在Walmart，Cisco，Home Depot，DHL等公司中仍然活跃。每次您在世界上最大的连锁超市或您最喜欢的橙色主题的家居装饰商店中购买商品时，交易记录在每个位置的Informix数据库之一中。定期将这些数据带回阿肯色州和乔治亚州进行合并。

尽管近年来通过Informix Warehouse Accelerator（或IWA）增加了内存支持，但世界正朝着异构支持和面向数据湖的体系结构迈进。这是Apache Spark的最佳选择，并且可能使您感到奇怪：“如何在Spark中卸载Informix数据库中的数据？”

在本教程中，您将学习如何从Informix收集数据。在本系列的第二部分中，您将学习如何添加其他数据源并分析数据。

要继续进行，您需要：

火花2.1.1
Informix 12.10.FC8
Java™1.8.0_60-b27
Informix JDBC驱动程序4.10.8.1
MacOS Sierra 10.12.5

注意：所有代码都在GitHub中。

在我的数据框中吸引客户

图1.将客户添加到数据框

在第一部分中，您将连接到stores_demo（著名的示例数据库）的customer表。

语法非常简单，但是有一个陷阱：请记住在连接中使用DELIMIDENT=Y以确保SQL查询的构建良好。了解有关DELIMIDENT的更多信息。

BasicCustomerLoader.java的摘要。

首先，通过连接到本地模式（或您的集群）创建一个Spark会话。

SparkSession spark = SparkSession
      .builder()
      .appName("Stores Customer")
      .master("local")
      .getOrCreate();

从表中读取数据并将其存储在数据框中。

Dataset<Row> df = spark
    .read()
    .format("jdbc")
    .option(
      "url",
      "jdbc:informix-sqli://[::1]:33378/stores_demo:IFXHOST=lo_informix1210;DELIMIDENT=Y")
    .option("dbtable", "customer")
    .option("user", "informix")
    .option("password", "in4mix")
    .load();
  df.printSchema();
  df.show(5);

您将获得架构。

|-- customer_num: long (nullable = false)
 |-- fname: string (nullable = true)
 |-- lname: string (nullable = true)
 |-- company: string (nullable = true)
 |-- address1: string (nullable = true)
 |-- address2: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zipcode: string (nullable = true)
 |-- phone: string (nullable = true)

然后是数据（仅显示前五行）：

+------------+---------------+---------------+--------------------+-------...
|customer_num|          fname|          lname|             company|       ...
+------------+---------------+---------------+--------------------+-------...
|         101|Ludwig         |Pauli          |All Sports Supplies |213 Ers…
|         102|Carole         |Sadler         |Sports Spot         |785 Gea…
|         103|Philip         |Currie         |Phil's Sports       |654 Pop…
|         104|Anthony        |Higgins        |Play Ball!          |East Sh…
|         105|Raymond        |Vector         |Los Altos Sports    |1899 La…
+------------+---------------+---------------+--------------------+-------...

将整个数据库转储到Spark中

图2. Spark中的整个数据库

JDBC Metadata API列出了所有表，然后一张一张地将它们加载到Spark中。仅加载表和视图，并拒绝系统表，同义词，别名等。

但是，在导入数据时，Spark将无法识别Informix中存在的某些不透明数据类型。为了避免这种情况，您需要为Informix创建特定的方言。 JDBC方言可在数据传入（传出）以及设置一些参数时帮助Spark。

建立翻译方言

net.jgp.labs.informix2spark.utils软件包中InformixJdbcDialect.java的摘要。

您需要一些Spark和Scala的软件包。

import org.apache.spark.sql.jdbc.JdbcDialect;
import org.apache.spark.sql.jdbc.JdbcType;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.MetadataBuilder;

import scala.Option;

方言类继承自JdbcDialect ，但您不会覆盖所有方法。

public class InformixJdbcDialect extends JdbcDialect {

主要方法是canHandle ，它基于JDBC URL来确定这是否是正确的方言。在这种情况下，您检查URL是否以jdbc：informix-sqli开头，这是我们使用Informix数据库的一个很好的指示。

@Override
  public boolean canHandle(String url) {
    return url.startsWith("jdbc:informix-sqli");
  }

第二种方法是getCatalystType ，它基于从JDBC驱动程序检索的数据类型返回Catalyst将理解的数据类型。此列表包含stores_demo中的所有数据类型。如果您的应用程序使用更多数据类型，则必须在此处添加它们。

@Override
  public Option<DataType> getCatalystType(int sqlType,
      String typeName, int size, MetadataBuilder md) {
    if (typeName.toLowerCase().compareTo("calendar") == 0) {
      return Option.apply(DataTypes.BinaryType);
    }
    if (typeName.toLowerCase().compareTo(
        "calendarpattern") == 0) {
      return Option.apply(DataTypes.BinaryType);
    }
    if (typeName.toLowerCase().compareTo(
        "se_metadata") == 0) {
      return Option.apply(DataTypes.BinaryType);
    }
    if (typeName.toLowerCase().compareTo(
        "sysbldsqltext") == 0) {
      return Option.apply(DataTypes.BinaryType);
    }
    if (typeName.toLowerCase().startsWith("timeseries")) {
      return Option.apply(DataTypes.BinaryType);
    }
    if (typeName.toLowerCase().compareTo("st_point") == 0) {
      return Option.apply(DataTypes.BinaryType);
    }
    if (typeName.toLowerCase().compareTo("tspartitiondesc_t") == 0) {
      return Option.apply(DataTypes.BinaryType);
    }
    return Option.empty();
  }

请注意，此方法返回一个Option ，它来自Scala。

获取表列表

现在该仔细检查所有表格了。

net.jgp.labs.informix2spark.l100软件包中DatabaseLoader.java的摘要。出于可读性考虑，我删除了异常处理。

import java.sql.Connection;
import java.sql.DatabaseMetaData;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.jdbc.JdbcDialect;
import org.apache.spark.sql.jdbc.JdbcDialects;

import net.jgp.labs.informix2spark.utils.Config;
import net.jgp.labs.informix2spark.utils.ConfigManager;
import net.jgp.labs.informix2spark.utils.InformixJdbcDialect;
import net.jgp.labs.informix2spark.utils.K;

public class DatabaseLoader {

  private List<String> getTables(Connection connection) {
    List<String> tables = new ArrayList<>();

获取连接的元数据。

DatabaseMetaData md;
    md = connection.getMetaData();

从那里查询表。此语法返回所有表。

ResultSet rs;
    rs = md.getTables(null, null, "%", null);

现在，您可以像正常结果集一样浏览元数据结果集。

while (rs.next()) {

表名是第三列。

String tableName = rs.getString(3);

表类型是第四列。

String tableType = rs.getString(4).toLowerCase();
      System.out.print("Table [" + tableName + "] ... ");

仅保留表和视图。其他类型是系统表，全局临时，本地临时，别名和同义词。

if (tableType.compareTo("table") == 0
          || tableType.compareTo("view") == 0) {
        tables.add(tableName);
        System.out.println("is in (" + tableType + ").");
      } else {
        System.out.println("is out (" + tableType + ").");
      }
    }

    return tables;
  }
}

消耗CPU周期和内存

现在您有了所需的表列表，可以组装所有工作并创建数据框映射。

net.jgp.labs.informix2spark.l100软件包中DatabaseLoader.java的摘要。出于可读性考虑，我删除了异常处理和一些条件测试。

private void start() {

在本地模式下连接到Spark。

SparkSession spark = SparkSession
        .builder()
        .appName("Stores Data")
        .master("local")
        .getOrCreate();

这个带有构建器的小型配置对象简化了连接的管理。

Config config = ConfigManager.getConfig(K.INFORMIX);
    Connection connection = config.getConnection();

获取所有表格。

List<String> tables = getTables(connection);
    if (tables.isEmpty()) {
      return;
    }

定义方言并将其注册在Spark中。

JdbcDialect dialect = new InformixJdbcDialect();
    JdbcDialects.registerDialect(dialect);

准备您的地图。该映射由表名称和数据框索引。

Map<String, Dataset<Row>> database = new HashMap<>();

浏览所有表格。

for (String table : tables) {
      System.out.print("Loading table [" + table
          + "] ... ");

遵循与以前相同的原则，只是每次使用不同的表。

Dataset<Row> df = spark
      	.read()
  	    .format("jdbc")
  	    .option("url", config.getJdbcUrl())
  	    .option("dbtable", table)
  	    .option("user", config.getUser())
  	    .option("password", config.getPassword())
  	    .option("driver", config.getDriver())
  	    .load();
      database.put(table, df);
      System.out.println("done");
    }

选择一个随机表（状态显示在此处），对其进行分析，然后打印前五行。

System.out.println("We have " + database.size()
        + " table(s) in our database");
    Dataset<Row> df = database.get("state");

    df.printSchema();
    System.out.println("Number of rows in state: " + df
        .count());
    df.show(5);
  }

执行该程序以获取以下信息。

Table [sysaggregates] ... is out (system table).
Table [sysams] ... is out (system table).
…
Table [call_type] ... is in (table).
Table [catalog] ... is in (table).
Table [classes] ... is in (table).
Table [cust_calls] ... is in (table).
Table [customer] ... is in (table).
Table [customer_ts_data] ... is in (table).
Table [employee] ... is in (table).
Table [ext_customer] ... is in (table).
Table [items] ... is in (table).
Table [manufact] ... is in (table).
Table [orders] ... is in (table).
Table [se_metadatatable] ... is in (table).
Table [se_views] ... is in (table).
…
Loading table [customer] ... done
Loading table [customer_ts_data] ... done
Loading table [employee] ... done
Loading table [ext_customer] ... done
Loading table [items] ... done
Loading table [manufact] ... done
Loading table [orders] ... done
Loading table [se_metadatatable] ... done
Loading table [se_views] ... done
Loading table [state] ... done
Loading table [stock] ... done
…
We have 45 table(s) in our database

root
 |-- code: string (nullable = true)
 |-- sname: string (nullable = true)

仅显示“状态”表的前五行：

+----+---------------+
|code|          sname|
+----+---------------+
|  AK|Alaska         |
|  HI|Hawaii         |
|  CA|California     |
|  OR|Oregon         |
|  WA|Washington     |
+----+---------------+

向前走

现在您可以对这些数据进行分析了，但这是另一堂课。

在本教程中，我们使用了标准的Java代码以及标准的JDBC方法。我们为Informix设计了此代码。但是，适应另一个出色的数据库，例如Db2®，只需几分钟。

感谢Pradeep Natarajan，他现在拥有HCL，但仍然拥有Informix，当我有奇怪的问题时，他总是在那儿。

走得更远

要在Mac上安装和理解更好的Informix，请阅读“ Mac 10.12上的Informix 12.10，并带有一些Java 8：苹果，咖啡和一个伟大的数据库的故事 ”（自私无耻的插件；我写了这本书）。
从我的GitHub存储库下载本教程中的所有代码。如何使用JDBC将IBM Informix数据传输到Apache Spark 。别忘了喜欢叉！
了解更多接口DatabaseMetaData和JDBC DatabaseMetaData 。
要了解有关数据湖及其架构的更多信息，请阅读Ben Sharma的免费电子书，标题为“ Architecting Data Lakes” 。
通过数据存储之旅开始您的项目。

翻译自: https://www.ibm.com/developerworks/opensource/library/ba-offloading-informix-data-spark/index.html

资料收集