其他参考:
http://blog.csdn.net/yu616568/article/details/49915577
http://www.infoq.com/cn/articles/new-big-data-hadoop-query-engine-apache-calcite
原文
http://calcite.apache.org/docs/
Apache Calcite 动态数据管理框架、典型的数据库管理系统(不包含数据存储、处理数据的复杂算法、存储数据的仓库)
意在承担存储、处理数据的任务
是应用和多种数据存储落地、数据处理引擎间的优秀中间媒介。
一步步演示如何通过一个简单的adapter将CSV文件目录映射成一个包含tables的schema,然后提供SQL查询接口。
concepts
- user-defined schema using SchemaFactory and Schema interfaces;
- declaring schemas in a model JSON file;
- declaring views in a model JSON file;
- user-defined table using the Table interface;
- determining the record type of a table;
- a simple implementation of Table, using theScannableTable interface, that enumerates all rows directly;
- a more advanced implementation that implements FilterableTable, and can filter out rows according to simple predicates;
- advanced implementation of Table, using TranslatableTable, that translates to relational operators using planner rules.
Download and build
$ git clone https://github.com/apache/calcite.git
$ cd calcite
$ mvn install -DskipTests -Dcheckstyle.skip=true
$ cd example/csv
通过工程自带的 SQL shell sqlline连接Calcite
$ ./sqlline
sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin
执行 metadata query(类似的命令还有!columns
和 !describe
):
sqlline> !tables
+------------+--------------+-------------+---------------+----------+------+
| TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE |
+------------+--------------+-------------+---------------+----------+------+
| null | SALES | DEPTS | TABLE | null | null |
| null | SALES | EMPS | TABLE | null | null |
| null | SALES | HOBBIES | TABLE | null | null |
| null | metadata | COLUMNS | SYSTEM_TABLE | null | null |
| null | metadata | TABLES | SYSTEM_TABLE | null | null |
+------------+--------------+-------------+---------------+----------+------+
其中
EMPS
DEPTS
tables 是基于 target/test-classes目录下的
EMPS.csv
和 DEPTS.csv
【1】体会下扫表查询
sqlline> SELECT * FROM emps;
【2】Now JOIN and GROUP BY:
sqlline> SELECT d.name, COUNT(*)
. . . .> FROM emps AS e JOIN depts AS d ON e.deptno = d.deptno
. . . .> GROUP BY d.name;
【3】VALUES operator ,expressions and SQL built-in functions:
sqlline> VALUES CHAR_LENGTH('Hello, ' || 'world!');
+---------+
| EXPR$0 |
+---------+
| 13 |
+---------+
Schema discovery
作为一个 “database without a storage layer”,Calcite不感知任何文件格式,Calcite是通过我们提供的calcite-example-csv project知道这些tables的。
主要步骤:
1.根据model 文件中的一个 schema factory class 定义一个schema
2. schema factory 构建一个schema,schema创建一些表,每个table知道怎么扫描 CSV file得到数据
3.Calcite解析query,planned 这个query来使用这些表,执行query Calcite invokes 这些表去读取数据
通过JDBC连接串我们得到modelJSON格式的path,下面是这个model:
{ version: '1.0', defaultSchema: 'SALES', schemas: [ { name: 'SALES', type: 'custom', factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory', operand: { directory: 'target/test-classes/sales' } } ] } |
上模型定义了一个single schema “SALES”,这个schema powered by 这个plugin class(org.apache.calcite.adapter.csv.CsvSchemaFactory,这是 calcite-example-csv project的一部分),CsvSchemaFactory实现了SchemaFactory,它的create 方法传入mode文件中的
directory参数构建一个schema实例
public Schema create(SchemaPlus parentSchema, String name, Map<String, Object> operand) { String directory = (String) operand.get("directory"); String flavorName = (String) operand.get("flavor"); CsvTable.Flavor flavor; if (flavorName == null) { flavor = CsvTable.Flavor.SCANNABLE; } else { flavor = CsvTable.Flavor.valueOf(flavorName.toUpperCase()); } return new CsvSchema( new File(directory), flavor); } |
继承自model ,schema factory 实例化一个schema (‘SALES’),一个实现了interface Schema的 org.apache.calcite.adapter.csv.CsvSchema
schema 主要用于产生一系列的tables(还包括sub-schemas, table-functions等高级特性),这些tables 实现了Table interface,上例中CsvSchema产生CsvTable及其子类。
CsvSchema
, 重写getTableMap()建表。CsvSchema扫描整个目录directory,找到 “.csv”结尾的文件,如
EMPS.csv
and DEPTS.csv,构建 tables
EMPS
and DEPTS
.
Tables and views in schemas
model中我们不需要定义任何表,schema会自动的构建这些表
同时,除了动的构建的表,可通过schema中的tables属性定义额外的表
例如,下面展示如何构建一种非常重要有用的table,名曰view(看似一张表,不存储数据),下面是一个定义view的schema
type: 'view' 定义FEMALE_EMPS为view
当sql语句长,可以用list标识
type: 'view' 定义FEMALE_EMPS为view 当sql语句长,可以用list标识 { version: '1.0', defaultSchema: 'SALES', schemas: [ { name: 'SALES', type: 'custom', factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory', operand: { directory: 'target/test-classes/sales' }, tables: [ { name: 'FEMALE_EMPS', type: 'view', sql: 'SELECT * FROM emps WHERE gender = \'F\'' } ] } ] } ========== { name: 'FEMALE_EMPS', type: 'view', sql: [ 'SELECT * FROM emps', 'WHERE gender = \'F\'' ] } sqlline> SELECT e.name, d.name FROM female_emps AS e JOIN depts AS d on e.deptno = d.deptno; +--------+------------+ | NAME | NAME | +--------+------------+ | Wilma | Marketing | +--------+------------+ |
Custom tables
Custom tables是用户自定义代码实现的,而不是用 custom schema定义。
下面是一个以 model-with-custom-table.json标识的Custom tables
{ version: '1.0', defaultSchema: 'CUSTOM_TABLE', schemas: [ { name: 'CUSTOM_TABLE', tables: [ { name: 'EMPS', type: 'custom', factory: 'org.apache.calcite.adapter.csv.CsvTableFactory', operand: { file: 'target/test-classes/sales/EMPS.csv.gz', flavor: "scannable" } } ] } ] } == sqlline> !connect jdbc:calcite:model=target/test-classes/model-with-custom-table.json admin admin sqlline> SELECT empno, name FROM custom_table.emps; +--------+--------+ | EMPNO | NAME | +--------+--------+ | 100 | Fred | | 110 | Eric | | 110 | John | | 120 | Wilma | | 130 | Alice | +--------+--------+ |
这是一个常规的schema,包含一个custom table (powered by org.apache.calcite.adapter.csv.CsvTableFactory,实现TableFactory),create method 实例化一个CsvScannableTable,将model file传入
public CsvTable create(SchemaPlus schema, String name, Map<String, Object> map, RelDataType rowType) { String fileName = (String) map.get("file"); final File file = new File(fileName); final RelProtoDataType protoRowType = rowType != null ? RelDataTypeImpl.proto(rowType) : null; return new CsvScannableTable(file, protoRowType); } |
Optimizing queries using planner rules
Planner rules 优化查询
sqlline> !connect jdbc:calcite:model=target/test-classes/model.json admin admin sqlline> explain plan for select name from emps; +-----------------------------------------------------+ | PLAN | +-----------------------------------------------------+ | EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) | | EnumerableTableScan(table=[[SALES, EMPS]]) | +-----------------------------------------------------+ sqlline> !connect jdbc:calcite:model=target/test-classes/smart.json admin admin sqlline> explain plan for select name from emps; +-----------------------------------------------------+ | PLAN | +-----------------------------------------------------+ | EnumerableCalcRel(expr#0..9=[{inputs}], NAME=[$t1]) | | CsvTableScan(table=[[SALES, EMPS]]) | +-----------------------------------------------------+ |
在 smart.json model 文件中多出如下一行:
flavor: "translatable"
这指示CsvSchema建立CsvTranslatableTable,而不是CsvScannableTable,CsvTranslatableTable实现TranslatableTable.toRel() method 建立CsvTableScan,它作为叶子(query operator tree)一般会实现EnumerableTableScan。
下面是entirety的rules,constructor中声明了触发规则的相关expressions,onMatch method 生成一个relational expression 调用RelOptRuleCall.transformTo()指示rule触发生效
public class CsvProjectTableScanRule extends RelOptRule { public static final CsvProjectTableScanRule INSTANCE = new CsvProjectTableScanRule(); private CsvProjectTableScanRule() { super( operand(Project.class, operand(CsvTableScan.class, none())), "CsvProjectTableScanRule"); } @Override public void onMatch(RelOptRuleCall call) { final Project project = call.rel(0); final CsvTableScan scan = call.rel(1); int[] fields = getProjectFields(project.getProjects()); if (fields == null) { // Project contains expressions more complex than just field references. return; } call.transformTo( new CsvTableScan( scan.getCluster(), scan.getTable(), scan.csvTable, fields)); } private int[] getProjectFields(List<RexNode> exps) { final int[] fields = new int[exps.size()]; for (int i = 0; i < exps.size(); i++) { final RexNode exp = exps.get(i); if (exp instanceof RexInputRef) { fields[i] = ((RexInputRef) exp).getIndex(); } else { return null; // not a simple projection } } return fields; } } |
JDBC adapter
JDBC adapter将JDBC data source 中的schema映射为Calcite schema.
例如 schema从MySQL “foodmart” database中读信息:
{ version: '1.0', defaultSchema: 'FOODMART', schemas: [ { name: 'FOODMART', type: 'custom', factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory', operand: { jdbcDriver: 'com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser: 'foodmart', jdbcPassword: 'foodmart' } } ] } |
Current limitations:
JDBC adapter 只能下推table scan 操作,其他processing操作(filtering, joins, aggregations and so forth) 在Calcite中执行,我们的goal是尽可能下推processing
The cloning JDBC adapter
cloning JDBC adapter构建一个混合数据database,先从 JDBC database中读取数据,然后构建in-memory tables。后面的查询都是从in-memory tables中进行
例如:
{ version: '1.0', defaultSchema: 'FOODMART_CLONE', schemas: [ { name: 'FOODMART_CLONE', type: 'custom', factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory', operand: { jdbcDriver: 'com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser: 'foodmart', jdbcPassword: 'foodmart' } } ] } |
此外,以一个存在的schema构建一个clone schema,以source property 引用model中以定义的schema:
{ version: '1.0', defaultSchema: 'FOODMART_CLONE', schemas: [ { name: 'FOODMART', type: 'custom', factory: 'org.apache.calcite.adapter.jdbc.JdbcSchema$Factory', operand: { jdbcDriver: 'com.mysql.jdbc.Driver', jdbcUrl: 'jdbc:mysql://localhost/foodmart', jdbcUser: 'foodmart', jdbcPassword: 'foodmart' } }, { name: 'FOODMART_CLONE', type: 'custom', factory: 'org.apache.calcite.adapter.clone.CloneSchema$Factory', operand: { source: 'FOODMART' } } ] } |