Impala概念和架构（三） ——Impala如何融入Hadoop生态系统（英文翻译）

最新推荐文章于 2024-04-26 22:46:43 发布

置顶 tuohuang0303

最新推荐文章于 2024-04-26 22:46:43 发布

阅读量678

点赞数 1

分类专栏： Impala

本文链接：https://blog.csdn.net/tuohuang0303/article/details/83500698

版权

Impala 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

https://impala.apache.org/docs/build/html/topics/impala_hadoop.html#intro_metastore

Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and ELT pipelines.

IMPALA利用Hadoop生态系统中许多熟悉的组件。作为消费者和生产者，Impala可以与其他Hadoop组件交换数据，因此它可以以灵活的方式适应ETL和ELT管道。

Parent topic: Impala Concepts and Architecture Impala 的概念和体系结构

How Impala Works with Hive

A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing Apache Hive infrastructure that many Hadoop users already have in place to perform long-running, batch-oriented SQL queries.

Impala的主要目标是使SQL-on-Hadoop操作足够快速和高效，以吸引新类别的用户，并向新类型的用例开放Hadoop。在实际应用中，它使用许多Hadoop用户已经具备的现有Apache Hive基础结构来执行长期运行的、面向批处理的SQL查询。

In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as the metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and compression codecs.

具体而言，Impala将其表定义保存在传统的MySQL或PostgreSQL数据库中，称为元存储，与Hive保存此类数据的数据库相同。因此，只要所有列都使用支持Impala的数据类型、文件格式和压缩编解码器，Impala就可以访问由Hive定义或加载的表。

The initial focus on query features and performance means that Impala can read more types of data with the SELECT statement than it can write with the INSERT statement. To query data using the Avro, RCFile, or SequenceFile file formats, you load the data using Hive.

最初关注查询特性和性能，意味着Impala可以使用SELECT语句读取的数据类型多于比使用INSERT语句的数据类型。若要使用AVRO、RCfile或序列文件格式查询数据，则使用HIVE加载数据。

The Impala query optimizer can also make use of table statistics and column statistics. Originally, you gathered this information with the ANALYZE TABLE statement in Hive; in Impala 1.2.2 and higher, use the Impala COMPUTE STATS statement instead.

COMPUTE STATS requires less setup, is more reliable, and does not require switching back and forth between impala-shell and the Hive shell.

Impala 查询优化器还可以使用表统计统计和列统计。最初，查询优化器使用Hive中的ANALYZE TABLE语句收集此信息；在Impala 1.2.2和更高版本中，在Impala 中使用COMPUTE STATS语句代替。
COMPUTE STATS需要更少的设置，而且更加可靠，并且不需要在impala-shell 和Hive shell之间来回切换。

Overview of Impala Metadata and the Metastore

As discussed in How Impala Works with Hive, Impala maintains information about table definitions in a central database known as the metastore. Impala also tracks other metadata for the low-level characteristics of data files:

The physical locations of blocks within HDFS.

For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can be time-consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to reuse for future queries against the same table.

正如Impala如何与Hive一起工作所讨论的，Impala在称为元存储的中央数据库中维护关于表定义的信息。IMPLA还跟踪数据文件的低级特性的其他元数据：

HDFS中块的物理位置。

对于具有大量数据和/或许多分区的表，检索表的所有元数据可能非常耗时（直接访问元数据库表），在某些情况下需要几分钟。因此，每个IMPLA节点缓存所有这些元数据，在将来对同一个表进行查询时候可以重用。

If the table definition or the data in the table is updated, all other Impala daemons in the cluster must receive the latest metadata, replacing the obsolete cached metadata, before issuing a query against that table. In Impala 1.2 and higher, the metadata update is automatic, coordinated through the catalogd daemon, for all DDL and DML statements issued through Impala. See The Impala Catalog Service for details.

如果更新了表定义或表中的数据，则集群中的所有其他Impala守护进程必须在对该表发出查询之前接收到最新的元数据，替换过时的缓存元数据。在Impala 1.2和更高版本中，元数据更新是通过目录守护进程自动协调产生的，源于通过Impala发布的所有DDL和DML语句。详情请参阅Impala目录服务。

For DDL and DML issued through Hive, or changes made manually to files in HDFS, you still use the REFRESH statement (when new data files are added to existing tables) or the INVALIDATE METADATA statement (for entirely new tables, or after dropping a table, performing an HDFS rebalance operation, or deleting data files). Issuing INVALIDATE METADATA by itself retrieves metadata for all the tables tracked by the metastore. If you know that only specific tables have been changed outside of Impala, you can issue REFRESH table_name for each affected table to only retrieve the latest metadata for those tables.

对于通过Hive发布的DDL和DML，或者手动修改HDFS中的文件，您仍然使用REFRESH语句（当新的数据文件被添加到现有表时）或者INVALIDATE METADATA语句（对于完全新的表，或者在删除表之后，执行HDFS重新平衡操作，或删除数据文件）。发布INVALIDATE METADATA语句，可检索所有被跟踪记录于元存储中所有表的元数据。如果知道在Impala之外更改了特定的表，则可以针对每个受影响的表执行REFRESH table_name（表名），以便仅取得这些表的最新元数据。

How Impala Uses HDFS（Impala 如何使用HDFS）

Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the redundancy provided by HDFS to guard against hardware or network outages on individual nodes. Impala table data is physically represented as data files in HDFS, using familiar HDFS file formats and compression codecs. When data files are present in the directory for a new table, Impala reads them all, regardless of file name. New data is added in files with names controlled by Impala.

Impala 使用分布式文件系统HDFS作为其主要数据存储介质。Impala依赖于HDFS提供的冗余，以防止单个节点上的硬件或网络中断。使用熟悉的HDFS文件格式和压缩编解码器将IMPLA表数据物理地表示为HDFS中的数据文件。当数据文件存在（出现）于新表的目录中时，无论文件名如何，Impala都会读取它们。新数据添加到文件中，名称由Impala控制。

How Impala Uses HBase（Impala 如何使用HBase）

HBase is an alternative to HDFS as a storage medium for Impala data. It is a database storage system built on top of HDFS, without built-in SQL support. Many Hadoop users already have it configured and store large (often sparse) data sets in it. By defining tables in Impala and mapping them to equivalent tables in HBase, you can query the contents of the HBase tables through Impala, and even perform join queries including both Impala and HBase tables. See Using Impala to Query HBase Tables for details.
HBASE是作为IMPRA数据存储介质的HDFS的替代品。它是一个建立在HDFS之上的数据库存储系统，没有内置SQL支持。许多Hadoop用户已经配置并存储了大量的（通常稀疏的）数据集。通过在Impala中定义表并将它们映射到HBase中的等效表，您可以通过Impala查询HBase表的内容，甚至执行包括Impala表和HBase表的联接查询。有关详细信息，请参阅使用Impala查询HBASE表。

tuohuang0303

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Impala概念和架构（三） ——Impala如何融入Hadoop生态系统（英文翻译）

https://impala.apache.org/docs/build/html/topics/impala_hadoop.html#intro_metastoreImpala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with other Ha...
复制链接

扫一扫