Calcite-[3]-Adapters

最新推荐文章于 2024-09-22 18:34:13 发布

hjw199089

最新推荐文章于 2024-09-22 18:34:13 发布

阅读量1.8k

点赞数

分类专栏： [22]Calcite 文章标签： Calcite

[22]Calcite 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

原文：http://calcite.apache.org/docs/adapter.html

Schema adapters

schema adapter使得Calcite可以读取特定类型的数据，在schema中以tables的形式呈现数据

Cassandra adapter (calcite-cassandra)
CSV adapter (example/csv)
Druid adapter (calcite-druid)
Elasticsearch adapter (calcite-elasticsearch2 and calcite-elasticsearch5)
File adapter (calcite-file)
JDBC adapter (part of calcite-core)
MongoDB adapter (calcite-mongodb)
OS adapter (calcite-os)
Pig adapter (calcite-pig)
Solr cloud adapter (solr-sql)
Spark adapter (calcite-spark)
Splunk adapter (calcite-splunk)
Eclipse Memory Analyzer (MAT) adapter (mat-calcite-plugin)

Drivers

driver 使得我们的application可以连接Calcite

JDBC driver

Avatica支持JDBC driver ，Connections可以是local或者remote（JSON over HTTP or Protobuf over HTTP）

基本的form：

jdbc:calcite:property=value;property2=value2

JDBC connect string parameters

PROPERTY	DESCRIPTION
approximateDecimal	DECIMAL的聚合函数是否支持近似处理
approximateDistinctCount	`COUNT(DISTINCT ...)` aggregate functions 否支持近似处理
approximateTopN	“Top N” queries (`ORDER BY aggFun() DESC LIMIT n`) 是否支持近似
caseSensitive	identifiers是否 case-sensitively
conformance	SQL conformance level. Values: DEFAULT (the default, similar to PRAGMATIC_2003), LENIENT, MYSQL_5, ORACLE_10, ORACLE_12, PRAGMATIC_99, PRAGMATIC_2003, STRICT_92, STRICT_99, STRICT_2003, SQL_SERVER_2008.
createMaterializations	是否构建materializations，Default false.
defaultNullCollation	How NULL values should be sorted if neither NULLS FIRST nor NULLS LAST are specified in a query. The default, HIGH, sorts NULL values the same as Oracle.
druidFetch	Druid adapter 执行SELECT 查询拉取的数据行数
forceDecorrelate	planner是否尽可能的尝试de-correlating ，Default true.
fun	Collection of built-in functions and operators. Valid values: “standard” (the default), “oracle”.
lex	Lexical policy. Values are ORACLE (default), MYSQL, MYSQL_ANSI, SQL_SERVER, JAVA.
materializationsEnabled	Whether Calcite should use materializations. Default false.
model	URI of the JSON model file.
parserFactory	Parser factory. The name of a class that implements interface SqlParserImplFactory and has a public default constructor or an `INSTANCE`constant.
quoting	How identifiers are quoted. Values are DOUBLE_QUOTE, BACK_QUOTE, BRACKET. If not specified, value from `lex`is used.
quotedCasing	How identifiers are stored if they are quoted. Values are UNCHANGED, TO_UPPER, TO_LOWER. If not specified, value from `lex` is used.
schema	Name of initial schema.
schemaFactory	Schema factory. The name of a class that implements interface SchemaFactory and has a public default constructor or an `INSTANCE`constant. Ignored if `model` is specified.
schemaType	Schema type. Value must be “MAP” (the default), “JDBC”, or “CUSTOM” (implicit if `schemaFactory` is specified). Ignored if `model` is specified.
spark	Specifies whether Spark should be used as the engine for processing that cannot be pushed to the source system. If false (the default), Calcite generates code that implements the Enumerable interface.
timeZone	Time zone, for example “gmt-3”. Default is the JVM’s time zone.
typeSystem	Type system. The name of a class that implements interface RelDataTypeSystem and has a public default constructor or an `INSTANCE`constant.
unquotedCasing	How identifiers are stored if they are not quoted. Values are UNCHANGED, TO_UPPER, TO_LOWER. If not specified, value from `lex` is used.

可以以内置schema类型定义一个schema：

jdbc:calcite:schemaType=JDBC; schema.jdbcUser=SCOTT; schema.jdbcPassword=TIGER; schema.jdbcUrl=jdbc:hsqldb:res:foodmart

建立一个schema，通过JDBC schema schema adapter foodmart database

也可自定义：

jdbc:calcite:schemaFactory=org.apache.calcite.adapter.cassandra.CassandraSchemaFactory; schema.host=localhost; schema.keyspace=twissandra

建立一个和Cassandra adapter 的connection，相当于写了一个model file:

Note how each key in the operand section appears with a schema. prefix in the connect string.

{
  "version": "1.0",
  "defaultSchema": "foodmart",
  "schemas": [
    {
      type: 'custom',
      name: 'twissandra',
      factory: 'org.apache.calcite.adapter.cassandra.CassandraSchemaFactory',
      operand: {
        host: 'localhost',
        keyspace: 'twissandra'
      }
    }
  ]
}

Server

Calcite’s 的核心模块calcite-core支持SQL queries (SELECT) and DML operations (INSERT, UPDATE, DELETE, MERGE) ，但不支持DDL（CREATE SCHEMA or CREATE TABLE），DDL使得repository的state model 复杂化，parser不容易扩展。

server module (calcite-server) 支持DDL：

CREATE and DROP SCHEMA
CREATE and DROP FOREIGN SCHEMA
CREATE and DROP TABLE (including CREATE TABLE ... AS SELECT)
CREATE and DROP MATERIALIZED VIEW
CREATE and DROP VIEW

Commands 的描述见： SQL reference.

使能DDL，class path中包含calcite-server.jar，把parserFactory=org.apache.calcite.sql.parser.ddl.SqlDdlParserImpl#FACTORY加入到JDBC connect string(see connect string property parserFactory).

$ ./sqlline
sqlline version 1.3.0
> !connect jdbc:calcite:parserFactory=org.apache.calcite.sql.parser.ddl.SqlDdlParserImpl#FACTORY sa ""
> CREATE TABLE t (i INTEGER, j VARCHAR(10));
No rows affected (0.293 seconds)
> INSERT INTO t VALUES (1, 'a'), (2, 'bc');
2 rows affected (0.873 seconds)
> CREATE VIEW v AS SELECT * FROM t WHERE i > 1;
No rows affected (0.072 seconds)
> SELECT count(*) FROM v;
+---------------------+
|       EXPR$0        |
+---------------------+
| 1                   |
+---------------------+
1 row selected (0.148 seconds)
> !quit

Extensibility

There are many other APIs that allow you to extend Calcite’s capabilities.

Functions and operators

向Calcite中添加operators和functions

对于复杂的UDF：

user-defined operator (see interface SqlOperator)

operator 和标准SQL function syntax, “f(arg1, arg2, ...)”不一致时，参考extend the parser.

UDF和UDAF的例子class UdfTest

Aggregate functions

UDAF:

init 构建一个累加器;
add 将一行数据添加到累加器;
merge 两个累加器conbine;
result finalizes an accumulator and converts it to a result.

sum的伪代码:

struct Accumulator {
  final int sum;
}
Accumulator init() {
  return new Accumulator(0);
}
Accumulator add(Accumulator a, int x) {
  return new Accumulator(a.sum + x);
}
Accumulator merge(Accumulator a, Accumulator a2) {
  return new Accumulator(a.sum + a2.sum);
}
int result(Accumulator a) {
  return new Accumulator(a.sum + x);
}

Here is the sequence of calls to compute the sum of two rows with column values 4 and 7:

a = init()    # a = {0}
a = add(a, 4) # a = {4}
a = add(a, 7) # a = {11}
return result(a) # returns 11

Window functions

A window function is similar to an aggregate function but it is applied to a set of rows gathered by an OVER clause rather than by a GROUP BYclause. Every aggregate function can be used as a window function, but there are some key differences. The rows seen by a window function may be ordered, and window functions that rely upon order (RANK, for example) cannot be used as aggregate functions.

Another difference is that windows are non-disjoint: a particular row can appear in more than one window. For example, 10:37 appears in both the 9:00-10:00 hour and also the 9:15-9:45 hour.

Window functions are computed incrementally: when the clock ticks from 10:14 to 10:15, two rows might enter the window and three rows leave. For this, window functions have have an extra life-cycle operation:

remove removes a value from an accumulator.

It pseudo-code for SUM(int) would be:

Accumulator remove(Accumulator a, int x) {
  return new Accumulator(a.sum - x);
}

Here is the sequence of calls to compute the moving sum, over the previous 2 rows, of 4 rows with values 4, 7, 2 and 3:

a = init()       # a = {0}
a = add(a, 4)    # a = {4}
emit result(a)   # emits 4
a = add(a, 7)    # a = {11}
emit result(a)   # emits 11
a = remove(a, 4) # a = {7}
a = add(a, 2)    # a = {9}
emit result(a)   # emits 9
a = remove(a, 7) # a = {2}
a = add(a, 3)    # a = {5}
emit result(a)   # emits 5

Grouped window functions

Grouped window functions are functions that operate the GROUP BY clause to gather together records into sets. The built-in grouped window functions are HOP, TUMBLE and SESSION. You can define additional functions by implementing interface SqlGroupedWindowFunction.

Table functions and table macros

User-defined table functions are defined in a similar way to regular “scalar” user-defined functions, but are used in the FROM clause of a query. The following query uses a table function called Ramp:

SELECT * FROM TABLE(Ramp(3, 4))

User-defined table macros use the same SQL syntax as table functions, but are defined differently. Rather than generating data, they generate an relational expression. Table macros are invoked during query preparation and the relational expression they produce can then be optimized. (Calcite’s implementation of views uses table macros.)

class TableFunctionTest tests table functions and contains several useful examples.

Extending the parser

Suppose you need to extend Calcite’s SQL grammar in a way that will be compatible with future changes to the grammar. Making a copy of the grammar file Parser.jj in your project would be foolish, because the grammar is edited quite frequently.

Fortunately, Parser.jj is actually an Apache FreeMarker template that contains variables that can be substituted. The parser in calcite-coreinstantiates the template with default values of the variables, typically empty, but you can override. If your project would like a different parser, you can provide your own config.fmpp and parserImpls.ftl files and therefore generate an extended parser.

The calcite-server module, which was created in [CALCITE-707] and adds DDL statements such as CREATE TABLE, is an example that you could follow. Also see class ExtensionSqlParserTest.

Customizing SQL dialect accepted and generated

To customize what SQL extensions the parser should accept, implementinterface SqlConformance or use one of the built-in values in enum SqlConformanceEnum.

To control how SQL is generated for an external database (usually via the JDBC adapter), use class SqlDialect. The dialect also describes the engine’s capabilities, such as whether it supports OFFSET and FETCHclauses.

Defining a custom schema

To define a custom schema, you need to implement interface SchemaFactory.

During query preparation, Calcite will call this interface to find out what tables and sub-schemas your schema contains. When a table in your schema is referenced in a query, Calcite will ask your schema to create an instance of interface Table.

That table will be wrapped in a TableScan and will undergo the query optimization process.

Reflective schema

A reflective schema (class ReflectiveSchema) is a way of wrapping a Java object so that it appears as a schema. Its collection-valued fields will appear as tables.

It is not a schema factory but an actual schema; you have to create the object and wrap it in the schema by calling APIs.

See class ReflectiveSchemaTest.

Defining a custom table

To define a custom table, you need to implement interface TableFactory. Whereas a schema factory a set of named tables, a table factory produces a single table when bound to a schema with a particular name (and optionally a set of extra operands).

Modifying data

If your table is to support DML operations (INSERT, UPDATE, DELETE, MERGE), your implementation of interface Table must implementinterface ModifiableTable.

Streaming

If your table is to support streaming queries, your implementation of interface Table must implement interface StreamableTable.

See class StreamTest for examples.

Pushing operations down to your table

If you wish to push processing down to your custom table’s source system, consider implementing either interface FilterableTable orinterface ProjectableFilterableTable.

If you want more control, you should write a planner rule. This will allow you to push down expressions, to make a cost-based decision about whether to push down processing, and push down more complex operations such as join, aggregation, and sort.

Type system

You can customize some aspects of the type system by implementing interface RelDataTypeSystem.

Relational operators

All relational operators implement interface RelNode and most extendclass AbstractRelNode. The core operators (used bySqlToRelConverter and covering conventional relational algebra) areTableScan, TableModify, Values, Project, Filter,Aggregate, Join, Sort, Union, Intersect, Minus, Window andMatch.

Each of these has a “pure” logical sub-class, LogicalProject and so forth. Any given adapter will have counterparts for the operations that its engine can implement efficiently; for example, the Cassandra adapter hasCassandraProject but there is no CassandraJoin.

You can define your own sub-class of RelNode to add a new operator, or an implementation of an existing operator in a particular engine.

To make an operator useful and powerful, you will need planner rules to combine it with existing operators. (And also provide metadata, see below). This being algebra, the effects are combinatorial: you write a few rules, but they combine to handle an exponential number of query patterns.

If possible, make your operator a sub-class of an existing operator; then you may be able to re-use or adapt its rules. Even better, if your operator is a logical operation that you can rewrite (again, via a planner rule) in terms of existing operators, you should do that. You will be able to re-use the rules, metadata and implementations of those operators with no extra work.

Planner rule

A planner rule (class RelOptRule) transforms a relational expression into an equivalent relational expression.

A planner engine has many planner rules registered and fires them to transform the input query into something more efficient. Planner rules are therefore central to the optimization process, but surprisingly each planner rule does not concern itself with cost. The planner engine is responsible for firing rules in a sequence that produces an optimal plan, but each individual rules only concerns itself with correctness.

Calcite has two built-in planner engines: class VolcanoPlanner uses dynamic programming and is good for exhaustive search, whereas class HepPlanner fires a sequence of rules in a more fixed order.

Calling conventions

A calling convention is a protocol used by a particular data engine. For example, the Cassandra engine has a collection of relational operators, CassandraProject, CassandraFilter and so forth, and these operators can be connected to each other without the data having to be converted from one format to another.

If data needs to be converted from one calling convention to another, Calcite uses a special sub-class of relational expression called a converter (see class Converter). But of course converting data has a runtime cost.

When planning a query that uses multiple engines, Calcite “colors” regions of the relational expression tree according to their calling convention. The planner pushes operations into data sources by firing rules. If the engine does not support a particular operation, the rule will not fire. Sometimes an operation can occur in more than one place, and ultimately the best plan is chosen according to cost.

A calling convention is a class that implements interface Convention, an auxiliary interface (for instance interface CassandraRel), and a set of sub-classes of class RelNode that implement that interface for the core relational operators (Project,Filter, Aggregate, and so forth).

Built-in SQL implementation

How does Calcite implement SQL, if an adapter does not implement all of the core relational operators?

The answer is a particular built-in calling convention,EnumerableConvention. Relational expressions of enumerable convention are implemented as “built-ins”: Calcite generates Java code, compiles it, and executes inside its own JVM. Enumerable convention is less efficient than, say, a distributed engine running over column-oriented data files, but it can implement all core relational operators and all built-in SQL functions and operators. If a data source cannot an implement a relational operator, enumerable convention is a fall-back.

Statistics and cost

Calcite has a metadata system that allow you to define cost functions and statistics about relational operators, collectively referred to as metadata. Each kind of metadata has an interface with (usually) one method. For example, selectivity is defined by interface RelMdSelectivity and the method getSelectivity(RelNode rel, RexNode predicate).

There are many built-in kinds of metadata, including collation, column origins, column uniqueness, distinct row count, distribution, explain visibility,expression lineage, max row count, node types, parallelism, percentage original rows, population size, predicates, row count, selectivity, size, table references, unique keys, and selectivity; you can also define your own.

You can then supply a metadata provider that computes that kind of metadata for particular sub-classes of RelNode. Metadata providers can handle built-in and extended metadata types, and built-in and extended RelNode types. While preparing a query Calcite combines all of the applicable metadata providers and maintains a cache so that a given piece of metadata (for example the selectivity of the condition x > 10 in a particular Filter operator) is computed only once.