Impala概念和架构 (二) ——Impala应用开发 (英文翻译)

Developing Impala Applications

The core development language with Impala is SQL. You can also use Java or other languages to interact with Impala through the standard JDBC and ODBC interfaces used by many business intelligence tools. For specialized kinds of analysis, you can supplement the SQL built-in functions by writing user-defined functions (UDFs) in C++ or Java.

Impala的核心开发语言是SQL。您还可以使用Java或其他语言通过许多商业智能工具使用的标准JDBC和ODBC接口与Impala交互。对于特殊类型的分析,可以通过在C++或Java中编写用户定义函数(UDFS)来补充SQL内置函数。

Parent topic: Impala Concepts and Architecture

Overview of the Impala SQL Dialect

The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL). As such, it is familiar to users who are already familiar with running SQL queries on the Hadoop infrastructure. Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in functions. Impala also includes additional built-in functions for common industry features, to simplify porting SQL from non-Hadoop systems.

Impala SQL与Apache HIVE组件(HiVEQL)中使用的SQL语法高度兼容。因此,对于在Hadoop基础设施上运行SQL查询的用户是很熟悉的。目前,Impala SQL支持HIVEQL语句、数据类型和内置函数的子集。Impala还包括用于公共工业特性的附加内置函数,以简化从非Hadoop系统移植SQL。

For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect might seem familiar:

  • The SELECT statement includes familiar clauses such as WHEREGROUP BYORDER BY, and WITH. You will find familiar notions such as joinsbuilt-in functions for processing strings, numbers, and dates, aggregate functionssubqueries, and comparison operators such as IN() and BETWEEN. The SELECT statement is the place where SQL standards compliance is most important.

  • From the data warehousing world, you will recognize the notion of partitioned tables. One or more columns serve as partition keys, and the data is physically arranged so that queries that refer to the partition key columns in the WHERE clause can skip partitions that do not match the filter conditions. For example, if you have 10 years worth of data and use a clause such as WHERE year = 2015WHERE year > 2010, or WHERE year IN (2014, 2015), Impala skips all the data for non-matching years, greatly reducing the amount of I/O for the query.

  • In Impala 1.2 and higher, UDFs let you perform custom comparisons and transformation logic during SELECT and INSERT...SELECT statements.

对于从传统数据库或数据仓库背景来的Impala用户,SQL方言的以下方面可能看起来很熟悉:

  • SELECT语句包括熟悉的子句,如 WHEREGROUP BYORDER BY, and WITH。您将发现一些熟悉的概念,如连接join、用于处理字符串、数字和日期的内置函数、聚合函数、子查询和比较运算符,如IN()和BETWEEN。SELECT语句是SQL标准遵从最重要的地方。
  • 从数据仓库世界里,您将认识到分区表的概念。一个或多个列用作分区键,数据被物理地分隔处理,以便在WHERE子句中对于分区键列的查询可以跳过那些不匹配筛选条件的分区。例如,如果您有10年的数据(以年分区),使用WHERE year = 2015、 WHERE year > 2010 或  WHERE year IN (2014, 2015)这样的子句,Impala将跳过所有不匹配年的数据,从而大大减少了查询的I/O量。
  • 在Impala 1.2和更高版本中,UDF允许在SELECT和INSERT...SELECT语句期间执行自定义比较和转换逻辑。

For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the SQL dialect might require some learning and practice for you to become proficient in the Hadoop environment:

  • Impala SQL is focused on queries and includes relatively little DML. There is no UPDATE or DELETE statement. Stale data is typically discarded (by DROP TABLE or ALTER TABLE ... DROP PARTITION statements) or replaced (by INSERT OVERWRITE statements).

  • All data creation is done by INSERT statements, which typically insert data in bulk by querying from other tables. There are two variations, INSERT INTO which appends to the existing data, and INSERT OVERWRITE which replaces the entire contents of a table or partition (similar to TRUNCATE TABLE followed by a new INSERT). Although there is an INSERT ... VALUES syntax to create a small number of values in a single statement, it is far more efficient to use the INSERT ... SELECT to copy and transform large amounts of data from one table to another in a single operation.

  • You often construct Impala table definitions and data files in some other environment, and then attach Impala so that it can run real-time queries. The same data files and table metadata are shared with other components of the Hadoop ecosystem. In particular, Impala can access tables created by Hive or data inserted by Hive, and Hive can access tables and data produced by Impala. Many other Hadoop components can write files in formats such as Parquet and Avro, that can then be queried by Impala.

  • Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL includes some idioms that you might find in the import utilities for traditional database systems. For example, you can create a table that reads comma-separated or tab-separated text files, specifying the separator in the CREATE TABLE statement. You can create external tables that read existing data files but do not move or transform them.

  • Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does not require length constraints on string data types. For example, you can define a database column as STRING with unlimited length, rather than CHAR(1) or VARCHAR(64). (Although in Impala 2.0 and later, you can also use length-constrained CHAR and VARCHARtypes.)

对于从传统数据库或数据仓库背景来的Impala的用户,SQL方言的以下方面在Hadoop环境下可能需要一些学习和实践才能熟练掌握:

  • Impala SQL侧重于查询,包括相对较少的DML。没有更新或删除语句。过时数据通常被丢弃通过(由DROP表或ALTE表…)删除分区语句或替换(INSERT OVERWRITE语句)。
  • 所有数据创建都是通过插入语句实现的,这些语句通常通过从其他表中查询来批量插入数据。有两种变体,使用INSERT INTO添加到现有的表数据,INSERT OVERWRITE替换整张表或表分区的全部内容(类似于先使用TRUNCATE TABLE清空表,然后使用INSERT)。虽然有一个INSERT ... VALUES 语法在单个语句操作中创建少量的值,但是在单个操作中将大量数据从一个表复制到另一个表时,使用INSERT ... SELECT效率要高得多。。
  • 您经常在其他一些环境中构造Impala表的定义和数据文件,然后用Impala关联以便它能够运行实时查询。相同的数据文件和表的元数据与Hadoop生态系统的其他组件共享。特别是,Impala可以访问由Hive创建的表或由Hive插入的数据,并且Hive可以访问由Impala产生的表和数据。许多其他Hadoop组件可以用诸如Parquet 和AVro这样的格式编写文件,然后可以由Impala查询。
  • 因为Hadoop和Impala专注于大型数据集上的数据仓库式操作,所以对于Impala SQL包括一些习语,您可以在传统数据库系统的导入实用程序中找到这些习语。例如,可以创建读取分隔符为逗号分隔或制表符分隔的文本文件的表,在CREATE TABLE语句中指定分隔符。您可以创建外部表,外部表可以读取现有数据文件不用不移动其路径或传输数据到表中。
  • 因为Impala读取大量的数据,这些数据可能不是完全整齐和可预测的,所以它不需要字符串数据类型的长度约束。例如,可以将数据库列定义为具有无限长度的字符串,而不是CHAR(1)或VARCHAR(64)。(尽管在Impala 2.0和之后版本,您也可以使用长度受限的字符和VARCHAR类型。)

Related information: Impala SQL Language Reference, especially Impala SQL Statements and Impala Built-In Functions

 

Overview of Impala Programming Interfaces

You can connect and submit requests to the Impala daemons through:

With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications running on non-Linux platforms. You can also use Impala on combination with various Business Intelligence tools that use the JDBC and ODBC interfaces.

Each impalad daemon process, running on separate nodes in a cluster, listens to several ports for incoming requests. Requests from impala-shell and Hue are routed to the impalad daemons through the same port. The impalad daemons listen on separate ports for JDBC and ODBC requests.

您可以通过以下方式连接以及向Impala守护进程提交请求:

  • impala-shell交互式命令解释器。
  • Hub web用户界面。
  • JDBC。
  • ODBC。

通过上面的连接方式,您可以在异构环境中使用Impala,而在非Linux平台上运行JDBC或ODBC应用程序。您还可以使用Impala与各种商业智能工具结合使用,这些商业智能工具基于使用JDBC和ODBC接口。
每一个Impala守护进程在集群中的单独节点上运行,监听传入请求的几个端口。来自Impala shell和Hue的请求通过同一端口路由到impalad 守护进程。这些Impala守护进程在另外的端口上监听JDBC和ODBC请求。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值