Calcite-[1]-Tutorial

最新推荐文章于 2024-05-07 14:49:40 发布

hjw199089

最新推荐文章于 2024-05-07 14:49:40 发布

阅读量1k

点赞数

分类专栏： [22]Calcite 文章标签： Calcite

[22]Calcite 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

其他参考：

http://blog.csdn.net/yu616568/article/details/49915577

http://www.infoq.com/cn/articles/new-big-data-hadoop-query-engine-apache-calcite

原文

http://calcite.apache.org/docs/

Apache Calcite 动态数据管理框架、典型的数据库管理系统（不包含数据存储、处理数据的复杂算法、存储数据的仓库）

意在承担存储、处理数据的任务

是应用和多种数据存储落地、数据处理引擎间的优秀中间媒介。

一步步演示如何通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema，然后提供SQL查询接口。

concepts

user-defined schema using SchemaFactory and Schema interfaces;
declaring schemas in a model JSON file;
declaring views in a model JSON file;
user-defined table using the Table interface;
determining the record type of a table;
a simple implementation of Table, using theScannableTable interface, that enumerates all rows directly;
a more advanced implementation that implements FilterableTable, and can filter out rows according to simple predicates;
advanced implementation of Table, using TranslatableTable, that translates to relational operators using planner rules.

Download and build

$ git clone https://github.com/apache/calcite.git
$ cd calcite
$ mvn install -DskipTests -Dcheckstyle.skip=true
$ cd example/csv

通过工程自带的 SQL shell sqlline连接Calcite

$ ./sqlline

sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin

执行 metadata query（类似的命令还有!columns 和 !describe）:

sqlline> !tables
+------------+--------------+-------------+---------------+----------+------+
| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  |  TABLE_TYPE   | REMARKS  | TYPE |
+------------+--------------+-------------+---------------+----------+------+
| null       | SALES        | DEPTS       | TABLE         | null     | null |
| null       | SALES        | EMPS        | TABLE         | null     | null |
| null       | SALES        | HOBBIES     | TABLE         | null     | null |
| null       | metadata     | COLUMNS     | SYSTEM_TABLE  | null     | null |
| null       | metadata     | TABLES      | SYSTEM_TABLE  | null     | null |
+------------+--------------+-------------+---------------+----------+------+

其中

EMPS DEPTS tables 是基于 target/test-classes目录下的 EMPS.csv 和 DEPTS.csv

【1】体会下扫表查询

sqlline> SELECT * FROM emps;

【2】Now JOIN and GROUP BY:
sqlline> SELECT d.name, COUNT(*)
. . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno
. . . .> GROUP BY d.name;

【3】VALUES operator ，expressions and SQL built-in functions:

sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!');
+---------+
| EXPR$0 |
+---------+
| 13 |
+---------+

Schema discovery

作为一个 “database without a storage layer”，Calcite不感知任何文件格式,Calcite是通过我们提供的calcite-example-csv project知道这些tables的。

主要步骤：

1.根据model 文件中的一个 schema factory class 定义一个schema

2. schema factory 构建一个schema，schema创建一些表，每个table知道怎么扫描 CSV file得到数据

3.Calcite解析query，planned 这个query来使用这些表，执行query Calcite invokes 这些表去读取数据

通过JDBC连接串我们得到modelJSON格式的path，下面是这个model：

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      }
    }
  ]
}

上模型定义了一个single schema “SALES”，这个schema powered by 这个plugin class（org.apache.calcite.adapter.csv.CsvSchemaFactory，这是 calcite-example-csv project的一部分），CsvSchemaFactory实现了SchemaFactory，它的create 方法传入mode文件中的

directory参数构建一个schema实例

public Schema create(SchemaPlus parentSchema, String name,
    Map<String, Object> operand) {
  String directory = (String) operand.get("directory");
  String flavorName = (String) operand.get("flavor");
  CsvTable.Flavor flavor;
  if (flavorName == null) {
    flavor = CsvTable.Flavor.SCANNABLE;
  } else {
    flavor = CsvTable.Flavor.valueOf(flavorName.toUpperCase());
  }
  return new CsvSchema(
      new File(directory),
      flavor);
}

继承自model ，schema factory 实例化一个schema （‘SALES’），一个实现了interface Schema的 org.apache.calcite.adapter.csv.CsvSchema
schema 主要用于产生一系列的tables（还包括sub-schemas， table-functions等高级特性），这些tables 实现了Table interface，上例中CsvSchema产生CsvTable及其子类。

CsvSchema, 重写getTableMap()建表。CsvSchema扫描整个目录directory，找到 “.csv”结尾的文件，如 EMPS.csv and DEPTS.csv，构建 tables EMPS and DEPTS.

Tables and views in schemas

model中我们不需要定义任何表，schema会自动的构建这些表

同时，除了动的构建的表，可通过schema中的tables属性定义额外的表

例如，下面展示如何构建一种非常重要有用的table，名曰view（看似一张表，不存储数据），下面是一个定义view的schema

type: 'view' 定义FEMALE_EMPS为view
当sql语句长，可以用list标识

type: 'view' 定义FEMALE_EMPS为view
当sql语句长，可以用list标识
{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      },
      tables: [
        {
          name: 'FEMALE_EMPS',
          type: 'view',
          sql: 'SELECT * FROM emps WHERE gender = \'F\''
        }
      ]
    }
  ]
}
==========
{
  name: 'FEMALE_EMPS',
  type: 'view',
  sql: [
    'SELECT * FROM emps',
    'WHERE gender = \'F\''
  ]
}
 
sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno;
+--------+------------+
|  NAME  |    NAME    |
+--------+------------+
| Wilma  | Marketing  |
+--------+------------+

Custom tables

Custom tables是用户自定义代码实现的，而不是用 custom schema定义。

下面是一个以 model-with-custom-table.json标识的Custom tables

{
  version: '1.0',
  defaultSchema: 'CUSTOM_TABLE',
  schemas: [
    {
      name: 'CUSTOM_TABLE',
      tables: [
        {
          name: 'EMPS',
          type: 'custom',
          factory: 'org.apache.calcite.adapter.csv.CsvTableFactory',
          operand: {
            file: 'target/test-classes/sales/EMPS.csv.gz',
            flavor: "scannable"
          }
        }
      ]
    }
  ]
}
==
sqlline> !connect jdbc:calcite:model=target/test-classes/model-with-custom-table.json admin admin
sqlline> SELECT empno, name FROM custom_table.emps;
+--------+--------+
| EMPNO  |  NAME  |
+--------+--------+
| 100    | Fred   |
| 110    | Eric   |
| 110    | John   |
| 120    | Wilma  |
| 130    | Alice  |
+--------+--------+

这是一个常规的schema，包含一个custom table （powered by org.apache.calcite.adapter.csv.CsvTableFactory，实现TableFactory），create method 实例化一个CsvScannableTable，将model file传入

public CsvTable create(SchemaPlus schema, String name,
    Map<String, Object> map, RelDataType rowType) {
  String fileName = (String) map.get("file");
  final File file = new File(fileName);
  final RelProtoDataType protoRowType =
      rowType != null ? RelDataTypeImpl.proto(rowType) : null;
  return new CsvScannableTable(file, protoRowType);
}

Optimizing queries using planner rules

Planner rules 优化查询

sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN                                                |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
|   EnumerableTableScan(table=[[SALES, EMPS]])        |
+-----------------------------------------------------+
sqlline> !connect jdbc:calcite:model=target/test-classes/smart.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN                                                |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
|   CsvTableScan(table=[[SALES, EMPS]])               |
+-----------------------------------------------------+

在 smart.json model 文件中多出如下一行:
flavor: "translatable"
这指示CsvSchema建立CsvTranslatableTable，而不是CsvScannableTable，CsvTranslatableTable实现TranslatableTable.toRel() method 建立CsvTableScan，它作为叶子（query operator tree）一般会实现EnumerableTableScan。
下面是entirety的rules，constructor中声明了触发规则的相关expressions，onMatch method 生成一个relational expression 调用RelOptRuleCall.transformTo()指示rule触发生效

public class CsvProjectTableScanRule extends RelOptRule {
  public static final CsvProjectTableScanRule INSTANCE =
      new CsvProjectTableScanRule();

  private CsvProjectTableScanRule() {
    super(
        operand(Project.class,
            operand(CsvTableScan.class, none())),
        "CsvProjectTableScanRule");
  }

  @Override
  public void onMatch(RelOptRuleCall call) {
    final Project project = call.rel(0);
    final CsvTableScan scan = call.rel(1);
    int[] fields = getProjectFields(project.getProjects());
    if (fields == null) {
      // Project contains expressions more complex than just field references.
      return;
    }
    call.transformTo(
        new CsvTableScan(
            scan.getCluster(),
            scan.getTable(),
            scan.csvTable,
            fields));
  }

  private int[] getProjectFields(List<RexNode> exps) {
    final int[] fields = new int[exps.size()];
    for (int i = 0; i < exps.size(); i++) {
      final RexNode exp = exps.get(i);
      if (exp instanceof RexInputRef) {
        fields[i] = ((RexInputRef) exp).getIndex();
      } else {
        return null; // not a simple projection
      }
    }
    return fields;
  }
}

JDBC adapter

JDBC adapter将JDBC data source 中的schema映射为Calcite schema.

例如 schema从MySQL “foodmart” database中读信息：

{
  version: '1.0',
  defaultSchema: 'FOODMART',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    }
  ]
}

Current limitations:
JDBC adapter 只能下推table scan 操作，其他processing操作(filtering, joins, aggregations and so forth) 在Calcite中执行，我们的goal是尽可能下推processing

The cloning JDBC adapter

cloning JDBC adapter构建一个混合数据database，先从 JDBC database中读取数据，然后构建in-memory tables。后面的查询都是从in-memory tables中进行
例如：

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    }
  ]
}

此外，以一个存在的schema构建一个clone schema，以source property 引用model中以定义的schema：

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    },
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
      operand: {
        source: 'FOODMART'
      }
    }
  ]
}

hjw199089

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Calcite-[1]-Tutorial

原文http://calcite.apache.org/docs/Apache Calcite 动态数据管理框架、典型的数据库管理系统（不包含数据存储、处理数据的复杂算法、存储数据的仓库）意在承担存储、处理数据的任务是应用和多种数据存储落地、数据处理引擎间的优秀中间媒介conceptsuser-defined schema using SchemaF
复制链接

扫一扫

专栏目录