java impala shell 命令_impala-shell 命令参考(翻译)

SET 命令可用的查询选项

你可以在 impala-shell 会话中执行以下选项,并且本回话之后的所有查询都会生效。

ABORT_ON_DEFAULT_LIMIT_EXCEEDED

与 DEFAULT_ORDER_BY_LIMIT 联合使用,以确保 ORDER BY 查询的结果集不会因意外截断。假如没有使用 LIMIT 子句的 ORDER BY 查询结果集达到了 DEFAULT_ORDER_BY_LIMIT 选项的值,查询被取消而不是返回不正确的结果集。

Type: Boolean

Default: false (使用 SET 命令时显示为 0)

影响 ORDER BY 和 LIMIT 子句

当使用 LIMIT 子句时,DEFAULT_ORDER_BY_LIMIT 和 ABORT_ON_DEFAULT_LIMIT_EXCEEDED 不会生效;

[localhost:21000] > select x from three_rows order by x limit 5;

Query: select x from three_rows order by x limit 5

Query finished, fetching results ...

+---+

| x |

+---+

| 1 |

| 2 |

| 3 |

+---+

Returned 3 row(s) in 0.27s

只设置 DEFAULT_ORDER_BY_LIMIT 时,结果集被截断:

[localhost:21000] > set default_order_by_limit=5;

DEFAULT_ORDER_BY_LIMIT set to 5

[localhost:21000] > select x from ten_rows order by x;

Query: select x from ten_rows order by x

Query finished, fetching results ...

+---+

| x |

+---+

| 1 |

| 2 |

| 3 |

| 4 |

| 5 |

+---+

Returned 5 row(s) in 0.30s

当指定 ABORT_ON_DEFAULT_LIMIT_EXCEEDED 时,查询被取消而不是结果集被截断:

[localhost:21000] > set abort_on_default_limit_exceeded=true;

ABORT_ON_DEFAULT_LIMIT_EXCEEDED set to true

[localhost:21000] > select x from ten_rows order by x;

Query: select x from ten_rows order by x

Query aborted, unable to fetch data

Backend 0:DEFAULT_ORDER_BY_LIMIT has been exceeded.

ABORT_ON_ERROR

启用本选项后,当任意节点发生错误时 Impala 都会立即取消查询,而不是继续并可能返回不完全的结果。本选项默认启用,以便错误发生时收集最多的诊断信息,例如,是在所有节点还是只有一个节点发生错误。

Type: BOOLEAN

Default: false (使用 SET 命令时显示为 0)

ALLOW_UNSUPPORTED_FORMATS

如果设置为 true,则 Impala 会尝试在没有全部支持的文件格式上尝试执行。

Type: BOOLEAN

Default: false (使用 SET 命令时显示为 0)

BATCH_SIZE

批处理的大小,后台使用,默认为未指定或0(Batch size to be used by backend. Unspecified or a size of 0 indicates backend default)

Default: 0

DEBUG_ACTION

仅在 Cloudera support 的指导下使用

Type: STRING

Default: empty string

DEFAULT_ORDER_BY_LIMIT

Impala 查询中使用 ORDER BY 子句时必须也包括 LIMIT 子句,来避免偶然产生必须排序的海量数据集(排序一个海量数据集是内存密集的操作,直到排序结束才有结果返回,看起来查询就像没有响应一样)

为了避免修改你的源码,在每一个使用 ORDER BY 的查询中添加 LIMIT 子句,你可以设置 DEFAULT_ORDER_BY_LIMIT 查询选项为你的 ORDER BY 查询可能或期望返回最大行数。例如,如果你只关心 top 10 的结果,在执行查询前可以立即设置 DEFAULT_ORDER_BY_LIMIT=10。或者你设置 DEFAULT_ORDER_BY_LIMIT=1000000 作为合理性检测,来确认没有 ORDER BY 查询返回超出 100W 行;查询结果少于 100W 行的不受影响。

默认值 -1 表示结果集大小没有上限,这时候每一个 ORDER BY 查询都必须有 LIMIT 子句而不是通过查询选项来限制。

Default: -1 (no default limit)

影响 ORDER BY 和 LIMIT 子句

当指定了 LIMIT 子句时, DEFAULT_ORDER_BY_LIMIT 和 ABORT_ON_DEFAULT_LIMIT_EXCEEDED 不生效:

[localhost:21000] > select x from three_rows order by x limit 5;

Query: select x from three_rows order by x limit 5

Query finished, fetching results ...

+---+

| x |

+---+

| 1 |

| 2 |

| 3 |

+---+

Returned 3 row(s) in 0.27s

只指定 DEFAULT_ORDER_BY_LIMIT 时,结果集被截断:

[localhost:21000] > set default_order_by_limit=5;

DEFAULT_ORDER_BY_LIMIT set to 5

[localhost:21000] > select x from ten_rows order by x;

Query: select x from ten_rows order by x

Query finished, fetching results ...

+---+

| x |

+---+

| 1 |

| 2 |

| 3 |

| 4 |

| 5 |

+---+

Returned 5 row(s) in 0.30s

当指定 ABORT_ON_DEFAULT_LIMIT_EXCEEDED 时,查询被取消而不是结果集被截断:

[localhost:21000] > set abort_on_default_limit_exceeded=true;

ABORT_ON_DEFAULT_LIMIT_EXCEEDED set to true

[localhost:21000] > select x from ten_rows order by x;

Query: select x from ten_rows order by x

Query aborted, unable to fetch data

Backend 0:DEFAULT_ORDER_BY_LIMIT has been exceeded.

DISABLE_CODEGEN

这是个调试选项,用于诊断和确定导致崩溃的问题(This is a debug option, intended for diagnosing and working around issues that cause crashes)。假如查询失败返回 "非法指令(illegal instruction)" 或其他硬件特定的信息(hardware-specific message),尝试设置 DISABLE_CODEGEN=true 并重新执行查询。如果只有在开启 DISABLE_CODEGEN 选项的情况下查询才执行成功,问题和详细的问题报告提交给 Cloudera support。不建议开启此选项,会导致较低的整体性能。

Type: BOOLEAN

Default: false (使用 SET 命令时显示为 0)

EXPLAIN_LEVEL

设置为 verbose or 1 启用  EXPLAIN 命令中扩展信息的输出。设置为 normal or 0 取消扩展信息的输出。在性能调整中,当你需要确认表和列的统计信息对查询是否可用时,扩展信息特别有用。当使用 CDH 5 中的资源管理功能时,对于检查评估资源使用情况扩展信息同样有用。参见 EXPLAIN Statement 了解扩展信息的详细内容以及如何使用。

HBASE_CACHE_BLOCKS

设置此选项等价于调用 HBase Java 程序里的 org.apache.hadoop.hbase.client.Scan 类的setCacheBlocks 方法。和 HBASE_CACHING 一起使用,有助于控制 HBase region server 的内存压力。参见 Performance Considerations for the Impala-HBase Integration 了解详细信息。

Type: BOOLEAN

Default: false (使用 SET 命令时显示为 0)

HBASE_CACHING

设置此选项等价于调用 HBase Java 程序里的org.apache.hadoop.hbase.client.Scan类的 setCaching 方法。和HBASE_CACHE_BLOCKS 一起使用,有助于控制HBase region server 的内存压力。参见Performance Considerations for the Impala-HBase Integration了解详细信息

Type: BOOLEAN

Default: 0

MAX_ERRORS

错误被报告的最大错误数,默认为未指定或 0(Maximum number of errors to be reported. Unspecified or 0 indicates backend default)

Default: 0

MAX_IO_BUFFERS

每个硬盘的最大 I/O 缓冲区(Maximum number of I/O buffers (per disk))

Default: 0

MAX_SCAN_RANGE_LENGTH

扫描范围的最大长度,只使用于 HDFS 的扫描范围,默认为未指定或 0(Maximum length of the scan range. Only applicable to HDFS scan range. Unspecified or 0 indicates backend default)

Default: 0

MEM_LIMIT

没有启用资源管理时,定义每一节点查询可以分配的最大内存数量。假如任意节点上查询执行达到了指定的内存限制, Impala 会自动取消查询。当查询执行时内存限制会定期检查,因此在查询取消前实际内存使用可能会略略超过内存限制(When resource management is not enabled, defines the maximum amount of memory a query can allocate on each node. If query processing exceeds the specified memory limit on any node, Impala cancels the query automatically. Memory limits are checked periodically during query processing, so the actual memory in use might briefly exceed the limit without the query being cancelled)

在 CDH 5 中启用资源管理时,这一选项的机制发生了变化。如何设置了,将覆盖 Impala 中的自动内存估算。Impala 向每一节点上的 YARN 请求这些内存,直到内存可用查询才被允许。实际查询使用的内存可能更低,因为有些查询使用内存更少。启用资源管理,MEM_LIMIT 实际作为一个查询在每一节点上使用内存的硬件限制(由 YARN 来强制和确保当查询执行时每一节点上有这么多内存可用)。当启用资源管理而没有设置 MEM_LIMIT 时,Impala 估算每一查询在每一节点上所需的内存,再查询开始前向 YARN 请求这些内存,然后内部设置每一节点的 MEM_LIMIT 为请求的内存。因此,当查询需要比预估的更多的内存时,Impala 检测达到 MEM_LIMIT 限制,并取消查询本身(When resource management is enabled in CDH 5, the mechanism for this option changes. If set, it overrides the automatic memory estimate from Impala. Impala requests this amount of memory from YARN on each node, and the query does not proceed until that much memory is available. The actual memory used by the query could be lower, since some queries use much less memory than others. With resource management, the MEM_LIMIT setting acts both as a hard limit on the amount of memory a query can use on any node (enforced by YARN and a guarantee that that much memory will be available on each node while the query is being executed). When resource management is enabled but no MEM_LIMIT setting is specified, Impala estimates the amount of memory needed on each node for each query, requests that much memory from YARN before starting the query, and then internally sets the MEM_LIMIT on each node to the requested amount of memory during the query. Thus, if the query takes more memory than was originally estimated, Impala detects that the MEM_LIMIT is exceeded and cancels the query itself)

Default: 0

NUM_NODES

调试时使用,限制执行查询的节点数量

Default: 0

NUM_SCANNER_THREADS

扫描器线程数量(Number of scanner threads)

Default: 0

PARQUET_COMPRESSION_CODEC

当 Impala 使用 INSERT 语句写入 Parquet 数据文件时,相关的压缩选项由 PARQUET_COMPRESSION_CODEC 查询选项控制。此查询选项允许的值包括 snappy (默认), gzip, 和 none。选项值不区分大小写。参见 Snappy and GZip Compression for Parquet Data Files 了解详细信息和例子。

如何此选项设置为未验证的值,那么所有的查询都因为无效的值而失败,不仅仅是涉及到 Parquet 表的查询(If the option is set to an unrecognized value, all kinds of queries will fail due to the invalid option setting, not just queries involving Parquet tables)。

Default: SNAPPY

PARQUET_FILE_SIZE

设置 Impala INSERT 语句产生的 Parquet 数据文件的最大大小

Default: 0 (产生的数据文件最大 1 G)

RESERVATION_REQUEST_TIMEOUT (CDH5 Only)

Impala 预约完成或拒绝等待的最大毫秒数,在 Impala 1.2 和 CDH 5以上版本中与 Impala 资源管理功能联合使用(Maximum number of milliseconds Impala will wait for a reservation to be completely granted or denied. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher with CDH 5)

Default: 300000 (5 minutes)

SUPPORT_START_OVER

Leave this setting false.

Default: false

SYNC_DDL

When enabled, causes any DDL operation such as CREATE TABLE or ALTER TABLE to return only when the changes have been propagated to all other Impala nodes in the cluster by the Impala catalog service. That way, if you issue a subsequent CONNECT statement in impala-shell to connect to a different node in the cluster, you can be sure that other node will already recognize any added or changed tables. (The catalog service automatically broadcasts the DDL changes to all nodes automatically, but without this option there could be a period of inconsistency if you quickly switched to another node.)

Although INSERT is classified as a DML statement, when the SYNC_DDL option is enabled, INSERT statements also delay their completion until all the underlying data and metadata changes are propagated to all Impala nodes. Internally, Impala inserts have similarities with DDL statements in traditional database systems, because they create metadata needed to track HDFS block locations for new files and they potentially add new partitions to partitioned tables.

66439fb730c6d254bc472f3327a5db0a.png  Note: Because this option can introduce a delay after each write operation, if you are running a sequence of CREATE DATABASE, CREATE TABLE, ALTER TABLE, INSERT, and similar statements within a setup script, to minimize the overall delay you can enable the SYNC_DDL query option only near the end, before the final DDL statement.

Default: false

V_CPU_CORES (CDH5 Only)

The number of per-host virtual CPU cores to request from YARN. If set, the query option overrides the automatic estimate from Impala. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5.

Default: 0 (use automatic estimates)

YARN_POOL (CDH5 Only)

The YARN pool/queue name that queries should be submitted to. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5. Specifies the name of the pool used by resource requests from Impala to the YARN resource management framework.

Default: empty (use the user-to-pool mapping defined by an impalad startup option in the Impala configuration file)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值