Calcite-[1]-Tutorial

其他参考:

http://blog.csdn.net/yu616568/article/details/49915577

http://www.infoq.com/cn/articles/new-big-data-hadoop-query-engine-apache-calcite

原文

http://calcite.apache.org/docs/

Apache Calcite 动态数据管理框架、典型的数据库管理系统(不包含数据存储、处理数据的复杂算法、存储数据的仓库)

意在承担存储、处理数据的任务

是应用和多种数据存储落地、数据处理引擎间的优秀中间媒介。


一步步演示如何通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema,然后提供SQL查询接口。

concepts

  • user-defined schema using SchemaFactory and Schema interfaces;
  • declaring schemas in a model JSON file;
  • declaring views in a model JSON file;
  • user-defined table using the Table interface;
  • determining the record type of a table;
  • a simple implementation of Table, using theScannableTable interface, that enumerates all rows directly;
  • a more advanced implementation that implements FilterableTable, and can filter out rows according to simple predicates;
  • advanced implementation of Table, using TranslatableTable, that translates to relational operators using planner rules.

Download and build

$ git clone https://github.com/apache/calcite.git
$ cd calcite
$ mvn install -DskipTests -Dcheckstyle.skip=true
$ cd example/csv

通过工程自带的 SQL shell sqlline连接Calcite

./sqlline

sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin

执行 metadata query(类似的命令还有!columns 和 !describe):

sqlline> !tables
+------------+--------------+-------------+---------------+----------+------+
| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  |  TABLE_TYPE   | REMARKS  | TYPE |
+------------+--------------+-------------+---------------+----------+------+
| null       | SALES        | DEPTS       | TABLE         | null     | null |
| null       | SALES        | EMPS        | TABLE         | null     | null |
| null       | SALES        | HOBBIES     | TABLE         | null     | null |
| null       | metadata     | COLUMNS     | SYSTEM_TABLE  | null     | null |
| null       | metadata     | TABLES      | SYSTEM_TABLE  | null     | null |
+------------+--------------+-------------+---------------+----------+------+
其中

EMPS  DEPTS tables   是基于 target/test-classes目录下的 EMPS.csv 和 DEPTS.csv 

【1】体会下扫表查询

sqlline> SELECT * FROM emps;


 

【2】Now JOIN and GROUP BY:
sqlline> SELECT d.name, COUNT(*)
. . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno
. . . .> GROUP BY d.name;


【3】VALUES operator ,expressions and SQL built-in functions:

sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!');
+---------+
| EXPR$0 |
+---------+
| 13 |
+---------+

 

Schema discovery

作为一个 “database without a storage layer”,Calcite不感知任何文件格式,Calcite是通过我们提供的calcite-example-csv project知道这些tables的。

主要步骤:

1.根据model 文件中的一个 schema factory class 定义一个schema

2. schema factory 构建一个schema,schema创建一些表,每个table知道怎么扫描 CSV file得到数据

3.Calcite解析query,planned 这个query来使用这些表,执行query Calcite  invokes 这些表去读取数据

通过JDBC连接串我们得到modelJSON格式的path,下面是这个model:

{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      }
    }
  ]
}

上模型定义了一个single schema “SALES”,这个schema powered by 这个plugin class(org.apache.calcite.adapter.csv.CsvSchemaFactory,这是 calcite-example-csv project的一部分),CsvSchemaFactory实现了SchemaFactory,它的create 方法传入mode文件中的

directory参数构建一个schema实例
public Schema create(SchemaPlus parentSchema, String name,
    Map<String, Object> operand) {
  String directory = (String) operand.get("directory");
  String flavorName = (String) operand.get("flavor");
  CsvTable.Flavor flavor;
  if (flavorName == null) {
    flavor = CsvTable.Flavor.SCANNABLE;
  } else {
    flavor = CsvTable.Flavor.valueOf(flavorName.toUpperCase());
  }
  return new CsvSchema(
      new File(directory),
      flavor);
}
 

继承自model ,schema factory 实例化一个schema (‘SALES’),一个实现了interface Schema的 org.apache.calcite.adapter.csv.CsvSchema 
schema 主要用于产生一系列的tables(还包括sub-schemas, table-functions等高级特性),这些tables 实现了Table interface,上例中CsvSchema产生CsvTable及其子类。

CsvSchema, 重写getTableMap()建表。CsvSchema扫描整个目录directory,找到 “.csv”结尾的文件,如 EMPS.csv and DEPTS.csv,构建 tables EMPS and DEPTS.

 

Tables and views in schemas 

model中我们不需要定义任何表,schema会自动的构建这些表

同时,除了动的构建的表,可通过schema中的tables属性定义额外的表

例如,下面展示如何构建一种非常重要有用的table,名曰view(看似一张表,不存储数据),下面是一个定义view的schema

type: 'view' 定义FEMALE_EMPS为view
当sql语句长,可以用list标识

type: 'view' 定义FEMALE_EMPS为view
当sql语句长,可以用list标识
{
  version: '1.0',
  defaultSchema: 'SALES',
  schemas: [
    {
      name: 'SALES',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'target/test-classes/sales'
      },
      tables: [
        {
          name: 'FEMALE_EMPS',
          type: 'view',
          sql: 'SELECT * FROM emps WHERE gender = \'F\''
        }
      ]
    }
  ]
}
==========
{
  name: 'FEMALE_EMPS',
  type: 'view',
  sql: [
    'SELECT * FROM emps',
    'WHERE gender = \'F\''
  ]
}
 
sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno;
+--------+------------+
|  NAME  |    NAME    |
+--------+------------+
| Wilma  | Marketing  |
+--------+------------+


Custom tables

Custom tables是用户自定义代码实现的,而不是用 custom schema定义。

下面是一个以 model-with-custom-table.json标识的Custom tables

{
  version: '1.0',
  defaultSchema: 'CUSTOM_TABLE',
  schemas: [
    {
      name: 'CUSTOM_TABLE',
      tables: [
        {
          name: 'EMPS',
          type: 'custom',
          factory: 'org.apache.calcite.adapter.csv.CsvTableFactory',
          operand: {
            file: 'target/test-classes/sales/EMPS.csv.gz',
            flavor: "scannable"
          }
        }
      ]
    }
  ]
}
==
sqlline> !connect jdbc:calcite:model=target/test-classes/model-with-custom-table.json admin admin
sqlline> SELECT empno, name FROM custom_table.emps;
+--------+--------+
| EMPNO  |  NAME  |
+--------+--------+
| 100    | Fred   |
| 110    | Eric   |
| 110    | John   |
| 120    | Wilma  |
| 130    | Alice  |
+--------+--------+

这是一个常规的schema,包含一个custom table (powered by org.apache.calcite.adapter.csv.CsvTableFactory,实现TableFactory),create method 实例化一个CsvScannableTable,将model file传入

public CsvTable create(SchemaPlus schema, String name,
    Map<String, Object> map, RelDataType rowType) {
  String fileName = (String) map.get("file");
  final File file = new File(fileName);
  final RelProtoDataType protoRowType =
      rowType != null ? RelDataTypeImpl.proto(rowType) : null;
  return new CsvScannableTable(file, protoRowType);
}

Optimizing queries using planner rules

Planner rules 优化查询

sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN                                                |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
|   EnumerableTableScan(table=[[SALES, EMPS]])        |
+-----------------------------------------------------+
sqlline> !connect jdbc:calcite:model=target/test-classes/smart.json admin admin
sqlline> explain plan for select name from emps;
+-----------------------------------------------------+
| PLAN                                                |
+-----------------------------------------------------+
| EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) |
|   CsvTableScan(table=[[SALES, EMPS]])               |
+-----------------------------------------------------+

在 smart.json model 文件中多出如下一行:
flavor: "translatable"
这指示CsvSchema建立CsvTranslatableTable,而不是CsvScannableTable,CsvTranslatableTable实现TranslatableTable.toRel() method 建立CsvTableScan,它作为叶子(query operator tree)一般会实现EnumerableTableScan。
下面是entirety的rules,constructor中声明了触发规则的相关expressions,onMatch method 生成一个relational expression 调用RelOptRuleCall.transformTo()指示rule触发生效

public class CsvProjectTableScanRule extends RelOptRule {
  public static final CsvProjectTableScanRule INSTANCE =
      new CsvProjectTableScanRule();

  private CsvProjectTableScanRule() {
    super(
        operand(Project.class,
            operand(CsvTableScan.class, none())),
        "CsvProjectTableScanRule");
  }

  @Override
  public void onMatch(RelOptRuleCall call) {
    final Project project = call.rel(0);
    final CsvTableScan scan = call.rel(1);
    int[] fields = getProjectFields(project.getProjects());
    if (fields == null) {
      // Project contains expressions more complex than just field references.
      return;
    }
    call.transformTo(
        new CsvTableScan(
            scan.getCluster(),
            scan.getTable(),
            scan.csvTable,
            fields));
  }

  private int[] getProjectFields(List<RexNode> exps) {
    final int[] fields = new int[exps.size()];
    for (int i = 0; i < exps.size(); i++) {
      final RexNode exp = exps.get(i);
      if (exp instanceof RexInputRef) {
        fields[i] = ((RexInputRef) exp).getIndex();
      } else {
        return null; // not a simple projection
      }
    }
    return fields;
  }
}


JDBC adapter

JDBC adapter将JDBC data source 中的schema映射为Calcite schema.

例如 schema从MySQL “foodmart” database中读信息:

{
  version: '1.0',
  defaultSchema: 'FOODMART',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    }
  ]
}

Current limitations:
JDBC adapter 只能下推table scan 操作,其他processing操作(filtering, joins, aggregations and so forth) 在Calcite中执行,我们的goal是尽可能下推processing

The cloning JDBC adapter

cloning JDBC adapter构建一个混合数据database,先从 JDBC database中读取数据,然后构建in-memory tables。后面的查询都是从in-memory tables中进行
例如:

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    }
  ]
}

此外,以一个存在的schema构建一个clone schema,以source property 引用model中以定义的schema:

{
  version: '1.0',
  defaultSchema: 'FOODMART_CLONE',
  schemas: [
    {
      name: 'FOODMART',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory',
      operand: {
        jdbcDriver: 'com.mysql.jdbc.Driver',
        jdbcUrl: 'jdbc:mysql://localhost/foodmart',
        jdbcUser: 'foodmart',
        jdbcPassword: 'foodmart'
      }
    },
    {
      name: 'FOODMART_CLONE',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory',
      operand: {
        source: 'FOODMART'
      }
    }
  ]
}

展开阅读全文

没有更多推荐了,返回首页